- Bradley-Terry Preference Model:
- Optimal Policy in RLHF:
DPO Method
By taking the logarithm of both sides of equation (b), we can express the reward model in terms of the optimal policy , the reference policy , and the partition function :
This derived from (c) can then be applied to the ground-truth reward function or optimal policy . Fortunately, the Bradley-Terry (BT) model calculates preference using only .
When we apply the result from (c) to equation (a), the partition function conveniently cancels out:
Ultimately, this means that DPO allows us to directly derive the optimal policy that reflects preferences without the need for a separate reward function.
Bonus
This method is not limited to the Bradley-Terry model; it can be applied more generally to any Plackett-Luce model.
(Rafailov et al., 2024, pp. 4–5)