Tldr
DPO employs a clever trick to enable the training of an optimal policy without needing a separate reward model. The core idea is to express the reward function indirectly.
How DPO Enables Learning Without a Reward Model
DPO re-formulates the equations so that the same learning can occur without directly using a reward model. It uses a technique called change of variables to express the reward function solely in terms of three components:
- Optimal policy: The policy capable of generating responses we prefer.
- Reference policy: The existing baseline policy.
- Partition function: The normalization constant, which is typically challenging to compute.
Why This is Important
This approach is crucial because it allows us to optimize a policy using only preference data, without needing to explicitly train or evaluate a reward model. This makes the entire process significantly simpler and more efficient.
(Rafailov et al., 2024, p. 4)
Reference
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290