Direct Preference Optimization: Your Language Model Is Secretly a Reward Model - 2023

Details

Title : Direct Preference Optimization: Your Language Model Is Secretly a Reward Model uthor(s): Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D. and Finn, Chelsea Link(s) : http://arxiv.org/abs/2305.18290

Rough Notes

Recall Reinforcement Learning From Human Feedback (RLHF):

Collect text dataset.
Train a model on this dataset to predict text.
Collect data for reward modelling, by passing prompts \(x_i\) and gathering (at least 2) answers \(y_{ij}\), and the preferences \(p_{ij}\) for each of the answers. The prompts can be done by humans.
Train a reward model which produces a score for the answers \(y_{ij}\) where the training objective is to maximize the probability of choosing the preferred answer and minimizing the other answer (assuming binary answers).
Do RL where you want the model (agent) to generate answers (actions) to prompts (states) with higher rewards (which is from the reward model). The optimization objective also includes a regularizer to make sure the old (frozen) model's policy is not too different from the new model's policy - this also prevents reward hacking. Commonly used RL method here is PPO.

However, RL comes with its own problems.

In this work, the authors introduce an approach that bypasses the RLHF phase, turning the reward in the RLHF into the original LLM training objective. Here, the probability of human preference data is written with the optimal policy rather than with the reward model.