Understanding Reward Modeling in RLHF

1. Generate Responses With a Fine-tuned Model

A fine-tuned model, $π^{SFT}$ , is given a prompt $x$ to generate two distinct responses $y_{1}$ and $y_{2}$ .

(y_{1}, y_{2}) \sim π^{SFT} (y ∣ x)

2. Human Labeling (Preference Collection)

A human labeler evaluates the two responses $(y_{1}, y_{2})$ and indicates which one is better, resulting in a preferred response $y_{w}$ (winner) and a dispreferred response $y_{l}$ (loser). This pattern of preferences will be used by the reward model to assign future preference scores.

3. Reward Model

There are various methods for modeling preferences, with the Bradley-Terry (BT) model being the most popular.

Bradley-Terry Model:

p^{*} (y_{1} > y_{2} ∣ x) = \frac{exp ( r ^{*} ( x , y _{1} ))}{exp ( r ^{*} ( x , y _{1} )) + exp ( r ^{*} ( x , y _{2} ))}

This equation calculates the probability that $y_{1}$ is preferred over $y_{2}$ given prompt $x$ , based on their respective reward scores $r^{*}$ .

4. Maximum Likelihood Estimation (MLE)

MLE is used to train the reward model $r_{ϕ}$
This can be framed as a binary classification problem (classifying which response is preferred).
The training involves minimizing the negative log-likelihood function: $L_{R} (r_{ϕ}, D) = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))]$ This loss function aims to maximize the probability of assigning a higher reward to the preferred response ( $y_{w}$ ) compared to the dispreferred response ( $y_{l}$ ) based on the collected dataset $D$ .

(Rafailov et al., 2024, p. 3)

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290

JGKYM

Recent Notes

Resolving CUDA Initialization Errors with Accelerate in Kaggle Notebooks

How We Find ROIs

Deformable ROI Pooling–A Flexible Approach to Feature Extraction

The Core Idea of Supervised Contrastive Learning

Three Main Types of Distributed Training