Reinforcement Learning from Human Feedback (RLHF) requires two main stages, both of which come with significant costs:
- Training a Reward Model: First, you need to train a separate reward model that accurately reflects human preferences.
- Optimizing the Language Model with RL: Then, you use this learned reward model to optimize the language model itself through reinforcement learning algorithms like PPO.
As highlighted in here, applying reinforcement learning to large-scale models like LLMs presents several challenges:
- High Computational Cost: The process is extremely resource-intensive.
- Training Instability: The learning process can often be unstable and difficult to manage.
(Rafailov et al., 2024, p. 4)
Reference
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv. https://doi.org/10.48550/arXiv.2305.18290