Bradley-Terry (BT) Model
Interpretation
This equation models the probability that people prefer one response () over another () when given a specific prompt ().
- A latent (unknown) reward function : This function quantifies “how good” a response is for a given prompt . This underlying “goodness” is what the model tries to estimate.
- Exponential transformation: The function ensures that the reward values are always positive. This is important because probabilities must be non-negative.
- Softmax: The overall structure of the equation resembles a softmax function, which takes the exponentially transformed reward values and normalizes them into a probability ranging between 0 and 1. This allows us to interpret the output as a clear preference probability.
(Rafailov et al., 2024, p. 3)
Reference
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290