Why DPO Instead of RLHF? RL Is Expensive

Reinforcement Learning from Human Feedback (RLHF) requires two main stages, both of which come with significant costs:

Training a Reward Model: First, you need to train a separate reward model that accurately reflects human preferences.
Optimizing the Language Model with RL: Then, you use this learned reward model to optimize the language model itself through reinforcement learning algorithms like PPO.

As highlighted in here, applying reinforcement learning to large-scale models like LLMs presents several challenges:

High Computational Cost: The process is extremely resource-intensive.
Training Instability: The learning process can often be unstable and difficult to manage.

(Rafailov et al., 2024, p. 4)

Reference

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290

JGKYM

Recent Notes

Resolving CUDA Initialization Errors with Accelerate in Kaggle Notebooks

How We Find ROIs

Deformable ROI Pooling–A Flexible Approach to Feature Extraction

The Core Idea of Supervised Contrastive Learning

Three Main Types of Distributed Training

Why DPO Instead of RLHF? RL Is Expensive

Reference

Graph View