Tldr
While the formula for the optimal policy is known, the normalization constant (partition function) makes its computation prohibitively complex, making it impractical for real-world application.
Partition Function
The optimal solution for the RLHF objective takes the following form:
Here, represents the partition function (or normalization constant).
- Calculating is extremely computationally expensive because it requires summing over all possible responses .
- Even with relatively efficient methods like Maximum Likelihood Estimation (MLE), the computation of remains costly.
- Therefore, while an optimal policy theoretically exists, the significant computational burden makes its direct application challenging in practice.
(Rafailov et al., 2024, p. 4)
Reference
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290