Weihao Zeng$^$, Yuzhen Huang$^$, Wei Liu, Keqing He, Qian Liu, Zejun Ma, Junxian He$^*$
$^*$: Project lead
Github: https://github.com/hkust-nlp/simpleRL-reason
— Jan 25, 2025
<aside> 💡
This blog will present a replicate of the DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data, many of the experiments were developed and performed by us independently before DeepSeek-R1’s release. We show that long Chain-of-Thought (CoT) and self-reflection can emerge on a 7B model with only 8K MATH examples, and we achieve surprisingly strong results on complex mathematical reasoning. Importantly, we fully open-source our training code and details to the community to inspire more works on reasoning.
Starting from Qwen2.5-Math-7B (base model), we perform reinforcement learning on it directly with only 8K examples from the MATH dataset. No reward model, no SFT, just 8K MATH examples for verification, the resultant model achieves (pass@1 accuracy) 33.3% on AIME, 62.5% on AMC, and 77.2% on MATH, outperforming Qwen2.5-math-7B-instruct and being comparable to PRIME and rStar-MATH that use >50x more data and more complicated components. We also try performing a long CoT SFT with the same 8K examples before the RL stage and obtain even better performance.
</aside>
<aside> 👉
Many of our experiments were completed prior to the release of DeepSeek-R1. Interestingly, we independently converged on a similar and straightforward RL approach as DeepSeek-R1, finding it to be highly effective. The primary difference lies in our use of PPO instead of GRPO. While this research is still ongoing, we believe it is valuable to share our intermediate findings with the community. We hope our work serves as a simple yet effective replication of DeepSeek-R1 Zero and DeepSeek-R1, tailored for smaller models and limited datasets.
</aside>
Training dynamics of our Qwen2.5-SimpleRL-Zero training starting from the Qwen2.5-Math-7B base model, without SFT or reward models. The average benchmark accuracy and length are on 8 complex math reasoning benchmarks. We observe a length decrease in the initial stage because we found the Qwen2.5-Math-7B base model tended to generate both language and code in the response, resulting in lengthy outputs. This default pattern is quickly discouraged throughout RL and the model learns to output in a more appropriate format, and then the length starts to increase regularly. After just a few training steps, we also experienced the "aha moment" described in the DeepSeek-R1 paper — the emergence of self reflection in the model's responses.
Many researchers are exploring possible paths towards learning o-style models, such as distillation, MCTS, process-based reward models, and reinforcement learning. Recently, both DeepSeek-R1 and Kimi-k1.5 demonstrate an extremely simple recipe on this path, using simple RL algorithms to learn emerging long CoT and self-reflection patterns and leading to strong results, where no MCTS and reward models are used. However, their experiments are based on huge models in a large-scale RL setting. It remains unknown whether small models can demonstrate similar behaviors, how much data is needed, and how would the quantitative results compare with other approaches. This blog reproduces the training of DeepSeek-R1-Zero and DeepSeek-R1 for complex mathematical reasoning, starting from Qwen-2.5-Math-7B (base model), and only using 8K (query, final answer) examples from the original MATH dataset for rule-based reward modeling in RL. We are surprised how far the 8K MATH examples lift this 7B base model without any other external signals:
All results are in pass@1 accuracy
AIME 2024 | MATH 500 | AMC | Minerva Math | OlympiadBench | Avg. | |
---|---|---|---|---|---|---|
Qwen2.5-Math-7B-Base | 16.7 | 52.4 | 52.5 | 12.9 | 16.4 | 30.2 |
Qwen2.5-Math-7B-Base + 8K MATH SFT | 3.3 | 54.6 | 22.5 | 32.7 | 19.6 | 26.5 |
Qwen-2.5-Math-7B-Instruct | 13.3 | 79.8 | 50.6 | 34.6 | 40.7 | 43.8 |
Llama-3.1-70B-Instruct | 16.7 | 64.6 | 30.1 | 35.3 | 31.9 | 35.7 |
rStar-Math-7B | 26.7 | 78.4 | 47.5 | - | 47.1 | - |
Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
Qwen2.5-7B-SimpleRL-Zero | 33.3 | 77.2 | 62.5 | 33.5 | 37.6 | 48.8 |
Qwen2.5-7B-SimpleRL | 26.7 | 82.4 | 62.5 | 39.7 | 43.3 | 50.9 |
Qwen2.5-7B-SimpleRL-Zero is the simple RL training from the base model directly, using only 8K MATH examples. It achieves gains of nearly 20 absolute points on average compared to the base model. Compared to Qwen2.5-Math-7B-Base with the same 8K data SFT, RL enjoys much better generalization being 22% higher absolutely. Moreover, Qwen2.5-7B-SimpleRL-Zero outperforms Qwen-2.5-Math-7B-Instruct on average, and is roughly comparable to the recently released Eurus-2-7B-PRIME and rStar-Math-7B which are also based on Qwen-2.5-Math-7B. These baselines contain much more complicated components such as reward models and use at least 50x more and advanced data:
Data comparison of different approaches
Qwen2.5-Math-7B-Instruct | rStar-Math-7B | Eurus-2-7B-PRIME | Qwen2.5-7B-SimpleRL-Zero | |
---|---|---|---|---|
Base Model | Qwen2.5-Math-7B | Qwen2.5-Math-7B | Qwen2.5-Math-7B | Qwen2.5-Math-7B |
SFT Data | 2.5M (open-source and in-house) | ~7.3 M (MATH, NuminaMath, etc.) | 230K | 0 |
RM Data | 618K (in-house) | ~7 k (in-house) | 0 | 0 |
RM | Qwen2.5-Math-RM (72B) | None | Eurus-2-7B-SFT | None |
RL Data | 66K queries × 32 samples | ~3.647 M × 16 | 150K queries × 4 samples | 8K queries × 8 samples |
We are both excited and surprised by the significant gains achieved using only 8K MATH examples. Notably, while the MATH queries are considerably easier than many challenging benchmarks such as AIME and AMC, this simple RL recipe demonstrates remarkable generalization, with performance increasing by at least 10 absolute points compared to the base model. This easy-to-hard generalization effect is something we could not have envisioned with standard SFT training on the same dataset. We fully open-source our training code and details, hopefully as a strong baseline setup for the community to further explore the potential of RL for reasoning.