Interpreting The Optimization in RL Fine-Tuning within RLHF

Optimization Objective

E_{x \sim D, y \sim π_{θ} (y ∣ x)} [r_{ϕ} (x, y)] - β D_{KL} [π_{θ} (y ∣ x) ∥ π_{ref} (y ∣ x)]

$D$ : The distribution of the given data. It ensures that new prompts encountered by the model are consistent with the data it was trained on.
$π_{θ}$ : The LM (or policy) currently undergoing reinforcement learning. It’s typically initialized from a Supervised Fine-Tuned (SFT) model, often denoted as $π^{SFT}$ .
$π_{ref}$ : The reference policy, typically the $π^{SFT}$ model itself. It serves as a baseline to prevent the fine-tuned model from drifting too far from its original capabilities.
$x \sim D$ : The prompt $x$ follows the given data distribution. This implies that new input prompts won’t be drastically different from what the model has already encountered.
$y \sim π_{θ} (y ∣ x)$ : The response $y$ is generated by the currently reinforced LM, $π_{θ}$ .
$r_{ϕ} (x, y)$ : The reward model function, which scores how good a response $y$ is for a given prompt $x$ . This model is trained to align with human preferences.
$D_{KL} [π_{θ} (y ∣ x) ∥ π_{ref} (y ∣ x)]$ : The Kullback-Leibler (KL) divergence term. It measures how much the responses generated by the current RL-trained LM ( $π_{θ}$ ) deviate from the responses of the reference LM ( $π_{ref}$ ). This term is crucial because if $π_{θ}$ focuses too heavily on learning preferences (i.e., overfitting), it might lose its general response generation capabilities (e.g., factual accuracy).
$β$ : A hyperparameter that controls the strength of the penalty applied when $π_{θ}$ diverges from $π_{ref}$ . A higher $β$ means a stronger penalty, keeping $π_{θ}$ closer to $π_{ref}$ .

Interpretation

The overall goal is to improve the average preference score of generated responses. During this process, a constraint is imposed to prevent the model from sacrificing general response quality by becoming overly focused on merely mimicking preference patterns. This ensures that the fine-tuned LM produces responses that are both preferred by humans and maintain high quality.

(Rafailov et al., 2024, pp. 3–4)

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290

JGKYM

Recent Notes

Resolving CUDA Initialization Errors with Accelerate in Kaggle Notebooks

How We Find ROIs

Deformable ROI Pooling–A Flexible Approach to Feature Extraction

The Core Idea of Supervised Contrastive Learning

Three Main Types of Distributed Training

Interpreting The Optimization in RL Fine-Tuning within RLHF

Optimization Objective

Interpretation

Reference

Graph View

Table of Contents