1. Generate Responses With a Fine-tuned Model

A fine-tuned model, , is given a prompt to generate two distinct responses and .

2. Human Labeling (Preference Collection)

A human labeler evaluates the two responses and indicates which one is better, resulting in a preferred response ​ (winner) and a dispreferred response ​ (loser). This pattern of preferences will be used by the reward model to assign future preference scores.

3. Reward Model

There are various methods for modeling preferences, with the Bradley-Terry (BT) model being the most popular.

This equation calculates the probability that ​ is preferred over ​ given prompt , based on their respective reward scores .

4. Maximum Likelihood Estimation (MLE)

  1. MLE is used to train the reward model
  2. This can be framed as a binary classification problem (classifying which response is preferred).
  3. The training involves minimizing the negative log-likelihood function: This loss function aims to maximize the probability of assigning a higher reward to the preferred response () compared to the dispreferred response () based on the collected dataset .

(Rafailov et al., 2024, p. 3)

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290