Understanding Lemma 2 in DPO

Lemma 2

Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem. — Rafailov et al. (2024), p. 5

According to the Definition, two reward functions, $r$ and $r^{'}$ , belong to the same equivalence class if they differ only by a baseline function $f (x)$ . In simpler terms, $r^{'} (x, y) = r (x, y) + f (x)$ .

Let’s look at Eq. 4 (Rafailov et al., 2024, p. 4), which describes the optimal policy derived under the Plackett-Luce (specifically, Bradley-Terry) Preference framework:

π_{r^{'}} (y ∣ x) = \frac{1}{\sum _{y} π _{ref} ( y ∣ x ) exp ( \frac{1}{β} r ^{'} ( x , y ) )} π_{ref} (y ∣ x) exp (\frac{1}{β} r^{'} (x, y))

When we substitute $r^{'} (x, y)$ with $r (x, y) + f (x)$ , we can demonstrate that $π_{r^{'}} = π_{r}$ by leveraging the properties of exponential functions (Rafailov et al., 2024, p. 18).

This means that even if equivalent reward functions have different absolute reward values, their relative reward relationships between answers remain identical. Consequently, they will lead to the exact same optimal policy.

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290

JGKYM

Recent Notes

Resolving CUDA Initialization Errors with Accelerate in Kaggle Notebooks

How We Find ROIs

Deformable ROI Pooling–A Flexible Approach to Feature Extraction

The Core Idea of Supervised Contrastive Learning

Three Main Types of Distributed Training

Understanding Lemma 2 in DPO

Reference

Graph View

Backlinks