Tldr

DPO employs a clever trick to enable the training of an optimal policy without needing a separate reward model. The core idea is to express the reward function indirectly.

How DPO Enables Learning Without a Reward Model

DPO re-formulates the equations so that the same learning can occur without directly using a reward model. It uses a technique called change of variables to express the reward function solely in terms of three components:

  • Optimal policy: The policy capable of generating responses we prefer.
  • Reference policy: The existing baseline policy.
  • Partition function: The normalization constant, which is typically challenging to compute.

Why This is Important

This approach is crucial because it allows us to optimize a policy using only preference data, without needing to explicitly train or evaluate a reward model. This makes the entire process significantly simpler and more efficient.

(Rafailov et al., 2024, p. 4)

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290