Reward Modeling for RLHF

Wei Xiong@UIUC

Hanze Dong@Salesforce

Rui Yang@HKUST

Date: Mar 23, 2024

To Readers:

Leave comment or issue if you have any questions and enjoy building your own reward models!

TL; DR

This is the recipe for the GitHub repo used to train the reward model for RLHF.

8 x A40 48G: we can train Gemma-7B-it/Mistral-7B-inst-v0.2 with max_length 4096 by Deepspeed Zero-3 + Gradient checkpoint;
4 x A100 80G: we can train Gemma-7B-it/Mistral-7B-inst-v0.2 with max_length 4096 by Gradient checkpoint;
The resulting reward models achieve SOTA performance in the RMs with based model ≤ 13B in the leaderboard of RewardBench. They also outperform all existing DPO reward models. (Mar. 23, 2024)

1. Introduction

Reinforcement learning from human feedback (RLHF) is a leading technique to adapt the generation distribution to be preferred by human and has achieved tremendous success in ChatGPT by OpenAI, Claude by Anthropic, and Gemini by Google.

The most standard presented in the Instruct-GPT consists of three steps:

Preference data collection;