1. Generate Responses With a Fine-tuned Model
A fine-tuned model, , is given a prompt to generate two distinct responses and .
2. Human Labeling (Preference Collection)
A human labeler evaluates the two responses and indicates which one is better, resulting in a preferred response (winner) and a dispreferred response (loser). This pattern of preferences will be used by the reward model to assign future preference scores.
3. Reward Model
There are various methods for modeling preferences, with the Bradley-Terry (BT) model being the most popular.
This equation calculates the probability that is preferred over given prompt , based on their respective reward scores .
4. Maximum Likelihood Estimation (MLE)
- MLE is used to train the reward model
- This can be framed as a binary classification problem (classifying which response is preferred).
- The training involves minimizing the negative log-likelihood function: This loss function aims to maximize the probability of assigning a higher reward to the preferred response () compared to the dispreferred response () based on the collected dataset .
(Rafailov et al., 2024, p. 3)
Reference
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290