**Hanning Zhang, Jiarui Yao, Chenlu Ye, Wei Xiong$^\dagger$, Tong Zhang**$^\dagger$

$\dagger$: Project lead

GitHub Page: https://github.com/RLHFlow/Online-DPO-R1

Date: Feb 16, 2025

Table of Content

TL;DR:

Inspired by the success of Deepseek-R1-Zero and several replications of PPO training with rule-based reward which achieve superior performance on mathematical reasoning and the emergence of the “Aha moment” during the RL training, we are curious about alternative algorithms developed in the RLHF literature under this framework. In this blog, we implement rule-based RL from Qwen2.5-MATH-7B-base using iterative DPO and reward-ranked fine-tuning (RAFT). We train the models using the prompt set from the MATH training set and Numina-Math, and evaluate the models on AIME24, AMC23, MATH500, Minerva Math, and OlympiadBench. Our findings are as follows:

<aside> 💡

Figure: illustration of the iterative DPO pipeline. Here the exploration is implemented via best-of-n v.s. worst of n sampling. In other words, we sample n responses and use the response with highest reward and lowest reward as a preference pair. For RAFT training, the pipeline is similar except that we only use the positive data for fine-tuning.

Figure: illustration of the iterative DPO pipeline. Here the exploration is implemented via best-of-n v.s. worst of n sampling. In other words, we sample n responses and use the response with highest reward and lowest reward as a preference pair. For RAFT training, the pipeline is similar except that we only use the positive data for fine-tuning.

Experiment Settings

Rule-based Reward

We run all the experiments without using any neural reward models. Instead, we follow Deepseek-R1 and use the rule-based reward as follows: