NVIDIA's ProRL Unlocks Advanced Reasoning in AI Through Extended Reinforcement Learning
NVIDIA introduces ProRL, a novel reinforcement learning method that extends training duration to unlock new reasoning capabilities in AI models, achieving superior performance across multiple reasoning benchmarks.
Advancing Reasoning with Reinforcement Learning
Recent developments in reasoning-focused language models highlight the importance of scaling test-time computation. Reinforcement learning (RL) plays a pivotal role in enhancing reasoning abilities and preventing reward hacking, yet there is ongoing debate about whether RL actually extends reasoning capabilities or merely optimizes sampling efficiency of existing models.
Challenges in Current RL Research
Two major limitations hinder current research: a heavy reliance on specialized domains such as mathematics, which often leads to overtraining and reduced exploration, and prematurely stopping RL training before new reasoning capabilities fully develop, typically after only hundreds of steps.
Introducing ProRL by NVIDIA
NVIDIA researchers have introduced ProRL, a method that enables significantly longer RL training (over 2,000 steps) and leverages diverse training data spanning mathematics, coding, science, logic puzzles, and instruction following. ProRL facilitates deeper exploration and discovery of novel reasoning strategies beyond the base models’ capabilities.
Nemotron-Research-Reasoning-Qwen-1.5B: A Breakthrough Model
Using ProRL, the team developed Nemotron-Research-Reasoning-Qwen-1.5B, the leading 1.5B parameter reasoning model globally. It outperforms its base model DeepSeek-R1-1.5B and even surpasses the larger DeepSeek-R1-7B on multiple benchmarks, demonstrating that extended RL training can uncover new solution pathways previously absent in base models.
Diverse and Verifiable Training Dataset
The researchers compiled a robust dataset of 136,000 examples across five domains: mathematics, coding, STEM, logical puzzles, and instruction following. Training utilized the verl framework with enhancements of the GRPO method. Evaluation benchmarks included numerous prestigious tests such as AIME2024, AMC, Minerva Math, PRIME validation, HumanevalPlus, Reasoning Gym puzzles, GPQA Diamond, and IFEval.
Impressive Performance Gains
Nemotron-Research-Reasoning-Qwen-1.5B achieved an average 15.7% improvement on math benchmarks and 14.4% on competitive programming pass@1 accuracy. STEM reasoning and instruction following improved by 25.9% and 22.0%, respectively. Logic puzzle rewards increased by 54.8%, with strong generalization to unseen tasks. Compared to domain-specialized models DeepScaleR-1.5B and DeepCoder-1.5B, ProRL-trained model showed superior pass@1 results on math (+4.6%) and code (+6.5%).
Redefining the Potential of RL in Reasoning
This research provides clear evidence that extended and stable RL training fosters novel reasoning patterns beyond initial model capabilities. ProRL helps models internalize abstract reasoning transferable beyond training data, challenging previous assumptions about RL’s limits and opening pathways for more advanced reasoning AI models.
For further details, check the original paper and model page. Follow the researchers on Twitter and join the ML community for updates.
Сменить язык
Читать эту статью на русском