Bradley-Terry (BT) Model

Interpretation

This equation models the probability that people prefer one response () over another (​) when given a specific prompt ().

  • A latent (unknown) reward function : This function quantifies “how good” a response is for a given prompt . This underlying “goodness” is what the model tries to estimate.
  • Exponential transformation: The function ensures that the reward values are always positive. This is important because probabilities must be non-negative.
  • Softmax: The overall structure of the equation resembles a softmax function, which takes the exponentially transformed reward values and normalizes them into a probability ranging between 0 and 1. This allows us to interpret the output as a clear preference probability.

(Rafailov et al., 2024, p. 3)

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290