Wei Xiong@UIUC
Hanze Dong@Salesforce
Rui Yang@HKUST
Date: Mar 23, 2024
To Readers:
- Leave comment or issue if you have any questions and enjoy building your own reward models!
TL; DR
This is the recipe for the GitHub repo used to train the reward model for RLHF.
- 8 x A40 48G: we can train Gemma-7B-it/Mistral-7B-inst-v0.2 with max_length 4096 by Deepspeed Zero-3 + Gradient checkpoint;
- 4 x A100 80G: we can train Gemma-7B-it/Mistral-7B-inst-v0.2 with max_length 4096 by Gradient checkpoint;
- The resulting reward models achieve SOTA performance in the RMs with based model ≤ 13B in the leaderboard of RewardBench. They also outperform all existing DPO reward models. (Mar. 23, 2024)
1. Introduction
Reinforcement learning from human feedback (RLHF) is a leading technique to adapt the generation distribution to be preferred by human and has achieved tremendous success in ChatGPT by OpenAI, Claude by Anthropic, and Gemini by Google.
The most standard presented in the Instruct-GPT consists of three steps:
- Preference data collection;