ProRLv2: NVIDIA Extends Reinforcement Learning to Unlock Deeper LLM Reasoning

ProRLv2 is NVIDIA's next step in applying prolonged reinforcement learning to large language models. By increasing RL training horizons from 2,000 to 3,000 steps and combining several stabilization and exploration techniques, ProRLv2 demonstrates that extended RL can meaningfully expand reasoning, creativity, and solution discovery even in relatively small models.

What ProRLv2 changes in training

ProRLv2 focuses on the hypothesis that longer RL trajectories let models explore solution spaces that are otherwise unreachable. Instead of stopping at shorter RL schedules, ProRLv2 stretches the optimization window and pairs it with algorithmic safeguards to prevent instability and collapse during extended training.

Core innovations

NVIDIA combines multiple methods to enable long-horizon RL for LLMs:

REINFORCE++ Baseline: A more robust variant of policy gradient optimization tailored to thousands of RL steps, reducing the instability that normally plagues extended RL for language models.
KL Divergence Regularization & Reference Policy Reset: The pipeline periodically refreshes the reference policy to the current best checkpoint, which stabilizes learning and avoids the RL objective overwhelming the policy too early.
Decoupled Clipping & Dynamic Sampling (DAPO): This encourages discovery of diverse solutions by upweighting unlikely tokens and concentrating learning on prompts of intermediate difficulty.
Scheduled Length Penalty: Applied cyclically to maintain output diversity and prevent entropy collapse as training progresses.
Scaling RL Steps: The explicit jump from 2,000 to 3,000 RL steps is the central experimental variable, probing how much additional reasoning capacity extended RL can reveal.

What extended RL achieves in reasoning

When applied to Nemotron-Research-Reasoning-Qwen-1.5B-v2, ProRLv2 produces substantial improvements across reasoning categories. The model trained for 3,000 RL steps shows:

Notable pass@1 gains versus previous versions and competitor 1.5B models.
Continued improvement as RL steps increase, especially on tasks where the base model struggled.
Generalization to unseen tasks and emergence of novel solution strategies not explicitly present in training data.

Reported benchmark gains include average pass@1 improvements of 14.7% in math, 13.9% in coding, 54.8% in logic puzzles, 25.1% in STEM reasoning, and 18.1% in instruction-following tasks, with further improvements on harder, unseen benchmarks in v2.

Practical access: Nemotron-Research-Reasoning-Qwen-1.5B-v2

The latest checkpoint is available for testing on Hugging Face. Load the model with the standard Transformers API as shown below:

from transformers import AutoTokenizer, AutoModelForCausalLM
 
tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")

Why this matters for model scaling and RL research