Understanding The Equivalence Between Two Reward Models in DPO

Definition 1

We say that two reward functions $r (x, y)$ and $r^{'} (x, y)$ are equivalent iff $r (x, y) - r^{'} (x, y) = f (x)$ for some function $f$ . — Rafailov et al. (2024), p. 5

According to the definition, two reward function $r_{1}$ and $r_{2}$ , are considered equivalent if their difference depends only on the prompt $x$ and not on the response $y$ . This difference is expressed as a function $f (x)$ .

For example, if a given prompt $x$ , there are five possible responses $y_{1}, y_{2}, \dots, y_{5}$ , the difference between two equivalent reward functions $r_{1}$ and $r_{2}$ , will be the same for all possible responses, as shown below:

r_{1} (x, y_{1}) - r_{2} (x, y_{1}) r_{1} (x, y_{2}) - r_{2} (x, y_{2}) ⋮ r_{1} (x, y_{5}) - r_{2} (x, y_{5}) = f (x) = f (x) = f (x)

This means that for the same prompt $x$ , the difference between the values of the two reward functions will always remain a constant value, $f (x)$ , regardless of which response is generated.

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290

JGKYM

Recent Notes

Stirling's Approximation

UTF-8 Encoding

Unicode

Understanding Debouncing in Programming

Resolving CUDA Initialization Errors with Accelerate in Kaggle Notebooks

Understanding The Equivalence Between Two Reward Models in DPO

See also

Reference

Graph View

Table of Contents

Backlinks