Optimization Objective
-
: The distribution of the given data. It ensures that new prompts encountered by the model are consistent with the data it was trained on.
-
: The LM (or policy) currently undergoing reinforcement learning. It’s typically initialized from a Supervised Fine-Tuned (SFT) model, often denoted as .
-
: The reference policy, typically the model itself. It serves as a baseline to prevent the fine-tuned model from drifting too far from its original capabilities.
-
: The prompt follows the given data distribution. This implies that new input prompts won’t be drastically different from what the model has already encountered.
-
: The response is generated by the currently reinforced LM, .
-
: The reward model function, which scores how good a response is for a given prompt . This model is trained to align with human preferences.
-
: The Kullback-Leibler (KL) divergence term. It measures how much the responses generated by the current RL-trained LM () deviate from the responses of the reference LM (). This term is crucial because if focuses too heavily on learning preferences (i.e., overfitting), it might lose its general response generation capabilities (e.g., factual accuracy).
-
: A hyperparameter that controls the strength of the penalty applied when diverges from . A higher means a stronger penalty, keeping closer to .
Interpretation
The overall goal is to improve the average preference score of generated responses. During this process, a constraint is imposed to prevent the model from sacrificing general response quality by becoming overly focused on merely mimicking preference patterns. This ensures that the fine-tuned LM produces responses that are both preferred by humans and maintain high quality.
(Rafailov et al., 2024, pp. 3–4)