Lemma 1

Under the Plackett-Luce, and in particular the Bradley-Terry, preference framework, two reward functions from the same class induce the same preference distribution. — Rafailov et al. (2024), p. 5

The Plackett-Luce preference framework, including the Bradley-Terry model, is used to model the probability of preferring a specific item (e.g., a response ) among several options. This framework typically derives preference patterns based on the differences in reward function values. For instance, in the Bradley-Terry model, the probability that response is preferred over for a given prompt is calculated as follows:

This equation shows that the greater the difference in reward values between two responses, and , the higher the probability that will be preferred.

According to the Definition, if two reward functions and are equivalent, then can be expressed as:

When using two equivalent reward functions, and , the difference in reward values between two responses, and , is calculated as:

As the result shows, the reward difference between the two responses is the same whether calculated with or .

Since the Plackett-Luce framework is based on this ‘reward difference’ to determine preference patterns, reward functions belonging to the same equivalence class will ultimately induce identical preference probability distributions.

See also

Rafailov et al. (2024), p. 17

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290