**Hanning Zhang, Jiarui Yao, Chenlu Ye, Wei Xiong$^\dagger$, Tong Zhang**$^\dagger$

$\dagger$: Project lead

GitHub Page: https://github.com/RLHFlow/Online-DPO-R1

Date: Feb 16, 2025

Table of Content

TL;DR:

Inspired by the success of Deepseek-R1-Zero and several replications of PPO training with rule-based reward which achieve superior performance on mathematical reasoning and the emergence of the “Aha moment” during the RL training, we are curious about alternative algorithms developed in the RLHF literature under this framework. In this blog, we implement rule-based RL from Qwen2.5-MATH-7B-base using iterative DPO and reward-ranked fine-tuning (RAFT). We train the models using the prompt set from the MATH training set and Numina-Math, and evaluate the models on AIME24, AMC23, MATH500, Minerva Math, and OlympiadBench. Our findings are as follows:

<aside> 💡

DPO and RAFT significantly improve model performance while remaining efficient and easy to implement. After several iterations, our models achieve an overall accuracy of 47.0% for DPO, and 44.4% for RAFT, compared to 33.9% for the Base Model.
Iterative DPO does NOT benefit from the additional Negative Log-Likelihood (NLL) loss. We observe very similar performances between Iterative DPO with NLL (46.2%) and Iterative DPO without NLL (47.0%).
Iterative DPO benefits from diverse prompt sets. Iterative DPO with new prompts from Numina-Math in each iteration achieves higher performance (47.0%) than solely trained on prompts from the Competition MATH set (44.1%).
The llama-3 series does NOT work via iterative DPO with rule-based rewards. The success of rule-based RL on the Qwen2.5-MATH-7B-base model may be attributed to the pre-training stage of the Qwen model.
Compared to the PPO algorithm (51.8%), DPO/RAFT achieves an inferior performance, showing that PPO is still one of the most effective RL algorithms in this context. However, DPO with SFT Warm-Up could achieve competent performance (51.8%). </aside>

Figure: illustration of the iterative DPO pipeline. Here the exploration is implemented via best-of-n v.s. worst of n sampling. In other words, we sample n responses and use the response with highest reward and lowest reward as a preference pair. For RAFT training, the pipeline is similar except that we only use the positive data for fine-tuning.

Experiment Settings

Rule-based Reward

We run all the experiments without using any neural reward models. Instead, we follow Deepseek-R1 and use the rule-based reward as follows:

If the response contains the correct final answer in $\boxed$, it receives a reward of +1.
If the response contains a final answer in $\boxed$ but it is incorrect, it receives a reward of -0.5;