Understanding Lemma 1 in DPO

Lemma 1

Under the Plackett-Luce, and in particular the Bradley-Terry, preference framework, two reward functions from the same class induce the same preference distribution. — Rafailov et al. (2024), p. 5

The Plackett-Luce preference framework, including the Bradley-Terry model, is used to model the probability of preferring a specific item (e.g., a response $y$ ) among several options. This framework typically derives preference patterns based on the differences in reward function values. For instance, in the Bradley-Terry model, the probability $p (y_{1} > y_{2} ∣ x)$ that response $y_{1}$ is preferred over $y_{2}$ for a given prompt $x$ is calculated as follows:

p (y_{1} > y_{2} ∣ x) = σ (r (x, y_{1}) - r (x, y_{2}))

This equation shows that the greater the difference in reward values between two responses, $y_{1}$ and $y_{2}$ , the higher the probability that $y_{1}$ will be preferred.

According to the Definition, if two reward functions $r (x, y)$ and $r^{'} (x, y)$ are equivalent, then $r^{'} (x, y)$ can be expressed as:

r^{'} (x, y) = r (x, y) + f (x)

When using two equivalent reward functions, $r$ and $r^{'}$ , the difference in reward values between two responses, $y_{1}$ and $y_{2}$ , is calculated as:

r^{'} (x, y_{1}) - r^{'} (x, y_{2}) = (r (x, y_{1}) + f (x)) - (r (x, y_{2}) + f (x)) = r (x, y_{1}) - r (x, y_{2}) + f (x) - f (x) = r (x, y_{1}) - r (x, y_{2})

As the result shows, the reward difference between the two responses is the same whether calculated with $r$ or $r^{'}$ .

Since the Plackett-Luce framework is based on this ‘reward difference’ to determine preference patterns, reward functions belonging to the same equivalence class will ultimately induce identical preference probability distributions.

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290

JGKYM

Recent Notes

Resolving CUDA Initialization Errors with Accelerate in Kaggle Notebooks

How We Find ROIs

Deformable ROI Pooling–A Flexible Approach to Feature Extraction

The Core Idea of Supervised Contrastive Learning

Three Main Types of Distributed Training

Understanding Lemma 1 in DPO

See also

Reference

Graph View

Table of Contents

Backlinks