ProRLv2: NVIDIA Extends Reinforcement Learning to Unlock Deeper LLM Reasoning
ProRLv2 scales RL training to 3,000 steps and combines regularization and exploration techniques to expand reasoning capabilities in compact LLMs, showing strong benchmark gains across math, coding, logic and STEM tasks.
ProRLv2 is NVIDIA's next step in applying prolonged reinforcement learning to large language models. By increasing RL training horizons from 2,000 to 3,000 steps and combining several stabilization and exploration techniques, ProRLv2 demonstrates that extended RL can meaningfully expand reasoning, creativity, and solution discovery even in relatively small models.
What ProRLv2 changes in training
ProRLv2 focuses on the hypothesis that longer RL trajectories let models explore solution spaces that are otherwise unreachable. Instead of stopping at shorter RL schedules, ProRLv2 stretches the optimization window and pairs it with algorithmic safeguards to prevent instability and collapse during extended training.
Core innovations
NVIDIA combines multiple methods to enable long-horizon RL for LLMs:
- REINFORCE++ Baseline: A more robust variant of policy gradient optimization tailored to thousands of RL steps, reducing the instability that normally plagues extended RL for language models.
- KL Divergence Regularization & Reference Policy Reset: The pipeline periodically refreshes the reference policy to the current best checkpoint, which stabilizes learning and avoids the RL objective overwhelming the policy too early.
- Decoupled Clipping & Dynamic Sampling (DAPO): This encourages discovery of diverse solutions by upweighting unlikely tokens and concentrating learning on prompts of intermediate difficulty.
- Scheduled Length Penalty: Applied cyclically to maintain output diversity and prevent entropy collapse as training progresses.
- Scaling RL Steps: The explicit jump from 2,000 to 3,000 RL steps is the central experimental variable, probing how much additional reasoning capacity extended RL can reveal.
What extended RL achieves in reasoning
When applied to Nemotron-Research-Reasoning-Qwen-1.5B-v2, ProRLv2 produces substantial improvements across reasoning categories. The model trained for 3,000 RL steps shows:
- Notable pass@1 gains versus previous versions and competitor 1.5B models.
- Continued improvement as RL steps increase, especially on tasks where the base model struggled.
- Generalization to unseen tasks and emergence of novel solution strategies not explicitly present in training data.
Reported benchmark gains include average pass@1 improvements of 14.7% in math, 13.9% in coding, 54.8% in logic puzzles, 25.1% in STEM reasoning, and 18.1% in instruction-following tasks, with further improvements on harder, unseen benchmarks in v2.
Practical access: Nemotron-Research-Reasoning-Qwen-1.5B-v2
The latest checkpoint is available for testing on Hugging Face. Load the model with the standard Transformers API as shown below:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")Why this matters for model scaling and RL research
ProRLv2's main takeaway is that scaling RL itself is a lever for improving reasoning, not just model size or dataset scale. With careful regularization and exploration strategies, smaller architectures can learn deeper, more creative, and more generalizable reasoning behaviors. This reframes part of the research agenda: instead of only building ever larger models, investing in longer, stabilized RL schedules can yield comparable reasoning gains in compact models.
Сменить язык
Читать эту статью на русском