rStar2-Agent: How a 14B Agentic RL Model Beats Bigger Models at Math
Why “thinking longer” is not enough
Large language models improved mathematical reasoning by expanding Chain-of-Thought steps, essentially “thinking longer.” But longer internal reasoning often amplifies subtle mistakes instead of correcting them. When the initial approach is flawed, self-reflection inside the model frequently fails to detect and fix errors.
Microsoft’s rStar2-Agent takes a different path: it teaches the model to use external computational tools as part of its reasoning loop. Instead of relying solely on extending internal chains, the model writes and executes Python code, inspects results, and iterates based on concrete feedback. The full paper is available at https://arxiv.org/abs/2508.20722
Agentic reinforcement learning in practice
rStar2-Agent is a 14B-parameter model trained with an agentic reinforcement learning setup. During problem solving the model can produce code, run it in a Python environment, analyze execution outputs, and refine its approach. This creates an interactive reasoning process akin to how human mathematicians use computational tools to verify ideas and explore alternative solution paths.
This agentic approach turns reasoning into a sequence of proposal-execution-analysis cycles. Instead of hoping internal chain reasoning finds the right path, the model gains access to immediate, actionable feedback from a real execution environment.
Engineering infrastructure to scale tool calls
Training agentic models introduces massive infrastructure demands. A single training batch can produce tens of thousands of concurrent code execution requests, which would normally stall GPU utilization.
The research team solved this with two main engineering innovations:
- A distributed code execution service that handles up to 45,000 concurrent tool calls with sub-second latency. The runtime isolates execution from the main training loop and balances load across many CPU workers.
- A dynamic rollout scheduler that assigns work based on real-time GPU cache availability instead of static allocation. This prevents GPU idle time caused by skewed workloads when some reasoning traces require much heavier computation.
These improvements let the team complete the full training run in one week on 64 AMD MI300X GPUs, showing that frontier reasoning can come from smarter orchestration rather than pure scale.
GRPO-RoC: prioritizing high-quality reasoning traces
The core algorithmic contribution is Group Relative Policy Optimization with Resampling on Correct (GRPO-RoC). Standard RL setups in this context risk rewarding correct final answers even when intermediate tool usage is buggy or inefficient.
GRPO-RoC counters this with an asymmetric sampling and filtering strategy:
- Oversample initial rollouts to build a large pool of reasoning traces
- Preserve diversity among failed attempts to learn varied error modes
- Filter positive examples to emphasize traces with minimal tool errors and clean formatting
This makes the model learn from high-quality successful traces while still exposing it to a variety of failures. The result is more efficient use of tools and shorter, more focused reasoning traces.
Curriculum-style training from concise to complex
Training proceeds in three stages to avoid early biases and to progressively teach efficient reasoning:
- Supervised fine-tuning on non-reasoning tasks focused on instruction following and tool formatting. Responses are constrained to 8,000 tokens to force concise strategies. Performance jumps from near-zero to over 70% on challenging benchmarks after this stage.
- Extend token budget to 12,000 to allow more extended reasoning while preserving efficiency habits from stage one.
- Focus training on the hardest problems by filtering out items the model already masters, ensuring continuous learning on challenging cases.
This progression—from concise responses to extended, difficult reasoning—maximizes learning efficiency while keeping compute overhead manageable.
Breakthrough results and efficiency gains
rStar2-Agent-14B achieves 80.6% accuracy on AIME24 and 69.8% on AIME25, outperforming much larger models such as the 671B DeepSeek-R1. Importantly, it reaches these scores with much shorter reasoning traces: about 10,000 tokens on average versus over 17,000 for comparable models.
Beyond math, the model shows strong transfer learning: despite training mainly on math problems, it surpasses specialized models on scientific reasoning benchmarks and maintains competitive alignment performance.
Mechanisms revealed by analysis
Examining trained traces reveals two types of high-entropy tokens. Traditional forking tokens trigger internal exploration and self-reflection. A newer category, called reflection tokens, appears specifically when the model reacts to tool feedback.
Reflection tokens correspond to environment-driven reasoning steps: the model inspects execution results, diagnoses issues, and adjusts subsequent actions. This behavior produces more robust problem solving than pure chain-of-thought alone.
Implications for future AI systems
rStar2-Agent shows that moderate-sized models can reach frontier reasoning by coupling smart training algorithms, tool integration, and efficient infrastructure. The agentic approach points toward AI systems that blend text reasoning with active use of external tools and environments, enabling dynamic, interactive problem solving rather than static text generation.
For more details, see the paper at https://arxiv.org/abs/2508.20722 and the project’s GitHub resources for tutorials and notebooks.