1. Bradley-Terry Preference Model:
  1. Optimal Policy in RLHF:

DPO Method

By taking the logarithm of both sides of equation (b), we can express the reward model in terms of the optimal policy ​, the reference policy ​, and the partition function :

This derived from (c) can then be applied to the ground-truth reward function or optimal policy . Fortunately, the Bradley-Terry (BT) model calculates preference using only .

When we apply the result from (c) to equation (a), the partition function conveniently cancels out:

Ultimately, this means that DPO allows us to directly derive the optimal policy that reflects preferences without the need for a separate reward function.

Bonus

This method is not limited to the Bradley-Terry model; it can be applied more generally to any Plackett-Luce model.

(Rafailov et al., 2024, pp. 4–5)

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290