The Partition Function—Making RLHF Computations Difficult

Tldr

While the formula for the optimal policy is known, the normalization constant (partition function) makes its computation prohibitively complex, making it impractical for real-world application.

Partition Function

The optimal solution for the RLHF objective takes the following form:

π_{r} (y ∣ x) = \frac{1}{Z ( x )} π_{ref} (y ∣ x) exp (\frac{1}{β} r (x, y)),

Here, $Z (x)$ represents the partition function (or normalization constant).

Calculating $Z (x)$ is extremely computationally expensive because it requires summing over all possible responses $y$ .
Even with relatively efficient methods like Maximum Likelihood Estimation (MLE), the computation of $Z (x)$ remains costly.
Therefore, while an optimal policy theoretically exists, the significant computational burden makes its direct application challenging in practice.

(Rafailov et al., 2024, p. 4)

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290

JGKYM

Recent Notes

Stirling's Approximation

UTF-8 Encoding

Unicode

Understanding Debouncing in Programming

Resolving CUDA Initialization Errors with Accelerate in Kaggle Notebooks

The Partition Function—Making RLHF Computations Difficult

Partition Function

Reference

Graph View

Table of Contents

Backlinks