R-Zero: Self‑Evolving AI That Generates Its Own Training Data

Large Language Models have transformed language understanding, reasoning, and code generation, but pushing their reasoning beyond human-level performance remains constrained by the need for massive human-annotated datasets. R-Zero proposes a different route: a fully autonomous co-evolutionary framework that generates a curriculum of training problems from scratch, enabling models to self-improve without external labels.

The data bottleneck in reasoning LLMs

Most advances in LLM reasoning rely on datasets curated and labeled by humans. Creating these resources is costly and inherently constrained by human knowledge and priorities. Even approaches that avoid explicit labels often depend on existing collections of tasks to derive reward signals or generate training signals, which limits scalability and the potential to exceed human expertise.

How R-Zero works: Challenger and Solver

R-Zero introduces a two-agent setup based on the same base model: the Challenger and the Solver. They co-evolve in an iterative loop:

Challenger: trained with reinforcement learning to generate novel and challenging reasoning problems that sit near the Solver's current capability frontier.
Solver: fine-tuned on the problems produced by the Challenger, using pseudo-labels derived from majority voting among its own answers.

The cycle alternates between improving the Challenger's ability to create informative tasks and improving the Solver's reasoning via those tasks. This produces a continuously adapting curriculum tailored to the model's weaknesses and strengths.

Key technical innovations

Group Relative Policy Optimization (GRPO): a reinforcement learning method that normalizes rewards for generated answers relative to the group of responses for the same prompt, allowing efficient fine-tuning of policy LLMs without an explicit value function.
Uncertainty-driven curriculum: the Challenger is rewarded for producing problems that are neither trivial nor impossible. The reward peaks where Solver accuracy is around 50 percent, which maximizes learning signal according to the framework's analysis.
Repetition penalty and format checks: to ensure diversity and structure, generated batches are penalized for repetition and vetted with strict format checks.
Pseudo-label quality control: only question-answer pairs with intermediate answer consistency are used for training, filtering out ambiguous or ill-posed problems and improving label reliability.

Empirical performance

R-Zero was tested on multiple benchmarks. In mathematical reasoning tasks, including AMC, Minerva, MATH-500, GSM8K, Olympiad-Bench, and AIME, three iterations of R-Zero produced notable accuracy gains across model sizes and architectures. For example, Qwen3-8B-Base improved from an average score of 49.18 to 54.69 after three iterations.

Improvements also generalized beyond math. On general reasoning benchmarks such as MMLU-Pro, SuperGPQA, and BIG-Bench Extra Hard (BBEH), R-Zero showed significant transfer effects. Qwen3-8B-Base's overall average rose from 34.49 to 38.73, demonstrating that the self-generated curriculum benefits broader reasoning abilities.

Why this matters