Avoiding Explicit Reward Models in DPO

Bradley-Terry Preference Model:

p^{*} (y_{1} > y_{2} ∣ x) = σ (r^{*} (x, y_{1}) - r^{*} (x, y_{2})), (a)

Optimal Policy in RLHF:

π_{r} (y ∣ x) = \frac{1}{Z ( x )} π_{ref} (y ∣ x) exp (\frac{1}{β} r (x, y)), (b)

DPO Method

By taking the logarithm of both sides of equation (b), we can express the reward model $r$ in terms of the optimal policy $π_{r}$ , the reference policy $π_{ref}$ , and the partition function $Z (x)$ :

r (x, y) = β lo g \frac{π _{r} ( y ∣ x )}{π _{ref} ( y ∣ x )} + β lo g Z (x), (c)

This derived $r$ from (c) can then be applied to the ground-truth reward function $r^{*}$ or optimal policy $π^{*}$ . Fortunately, the Bradley-Terry (BT) model calculates preference using only $r^{*}$ .

When we apply the result from (c) to equation (a), the partition function $Z (x)$ conveniently cancels out:

p^{*} (y_{1} > y_{2} ∣ x) = σ (β lo g \frac{π ^{*} ( y _{1} ∣ x )}{π _{ref} ( y _{1} ∣ x )} - β lo g \frac{π ^{*} ( y _{2} ∣ x )}{π _{ref} ( y _{2} ∣ x )})

Ultimately, this means that DPO allows us to directly derive the optimal policy that reflects preferences without the need for a separate reward function.

Bonus

This method is not limited to the Bradley-Terry model; it can be applied more generally to any Plackett-Luce model.

(Rafailov et al., 2024, pp. 4–5)

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290

JGKYM

Recent Notes

Resolving CUDA Initialization Errors with Accelerate in Kaggle Notebooks

How We Find ROIs

Deformable ROI Pooling–A Flexible Approach to Feature Extraction

The Core Idea of Supervised Contrastive Learning

Three Main Types of Distributed Training

Avoiding Explicit Reward Models in DPO

DPO Method

Bonus

Reference

Graph View

Table of Contents