Lemma 2

Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem. — Rafailov et al. (2024), p. 5

According to the Definition, two reward functions, and , belong to the same equivalence class if they differ only by a baseline function . In simpler terms, .

Let’s look at Eq. 4 (Rafailov et al., 2024, p. 4), which describes the optimal policy derived under the Plackett-Luce (specifically, Bradley-Terry) Preference framework:

When we substitute with , we can demonstrate that by leveraging the properties of exponential functions (Rafailov et al., 2024, p. 18).

This means that even if equivalent reward functions have different absolute reward values, their relative reward relationships between answers remain identical. Consequently, they will lead to the exact same optimal policy.

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290