Two agents cloned from a single base model

Agent0 begins with a single base policy, for example Qwen3 4B Base or Qwen3 8B Base, and clones it into two distinct agents: a Curriculum Agent that invents tasks and an Executor Agent that attempts to solve them using a sandboxed Python tool. Training alternates between evolving the curriculum and improving the executor, creating a closed feedback loop where each agent pushes the other to higher capability.

Curriculum evolution and task scoring

During the curriculum stage, the Curriculum Agent generates a batch of tasks. For every task the Executor samples multiple responses, and a composite reward scores each task using three signals: uncertainty, tool usage, and repetition avoidance.

Uncertainty reward encourages tasks where the executor's self consistency is near 0.5, i.e., tasks that are challenging but not impossible for the current solver. Self consistency p̂(x) is computed by majority voting across k sampled responses.
Tool use reward favors tasks that trigger the Python interpreter. The trajectory's number of tool calls contributes to reward up to a cap (C=4 in experiments), steering the curriculum toward tool-dependent problems.
Repetition penalty discourages duplicate tasks within a batch by clustering with a BLEU-based distance and penalizing large clusters.

A format check gates the composite reward, and the Curriculum Agent is updated with Group Relative Policy Optimization (GRPO) using this composite value.

Executor evolution with noisy self labels

Once the Curriculum Agent is trained and frozen, it generates a large pool of candidate tasks. Agent0 filters this pool to keep tasks near the Executor's capability frontier, defined by self consistency p̂(x) falling within an informative band (for example 0.3–0.8). This produces a frontier dataset of problems that are neither trivial nor impossible.

For frontier tasks the Executor performs multi-turn tool-integrated rollouts that can interleave natural language reasoning, python code blocks, and output feedback from the sandboxed interpreter. Generation pauses on tool calls, executes code via a VeRL Tool-backed interpreter, and resumes conditioned on the results. Final answers are emitted inside boxed tags and majority voting across sampled trajectories defines pseudo labels and terminal rewards.

Training the Executor uses a modified RL objective called Ambiguity Dynamic Policy Optimization (ADPO). ADPO adapts GRPO to noisy labels by scaling the normalized advantage with self consistency and by setting a dynamic upper clipping bound for importance sampling that depends on p̂. These modifications down-weight highly ambiguous examples and relax clipping adaptively to aid exploration on uncertain tasks.

Empirical results across math and general reasoning

Agent0 is implemented on VeRL and evaluated on Qwen3 4B Base and Qwen3 8B Base, using a sandboxed Python interpreter as the single external tool. The evaluation covers ten benchmarks: mathematical reasoning tasks including AMC, Minerva, MATH, GSM8K, Olympiad Bench, AIME24, AIME25; and general reasoning tasks SuperGPQA, MMLU Pro, BBEH.

On Qwen3 8B Base Agent0 raises math average to 58.2 from 49.2 and overall general reasoning to 42.1 from 34.5. These improvements hold across three co-evolution iterations, where math performance rises from 55.1 to 58.2 and general reasoning steadily improves, indicating stable self-improvement rather than collapse.

Agent0 also outperforms prior data-free or zero-data baselines such as R Zero, Absolute Zero, SPIRAL and Socratic Zero, both with and without external tools or APIs. For Qwen3 8B it exceeds R Zero by 6.4 percentage points and Absolute Zero by 10.6 points on the overall average.

Qualitative examples show the curriculum evolving from simple geometry problems to complex constraint satisfaction questions, while executor trajectories mix textual reasoning and Python calls to reach correct answers.

Why this matters

Agent0 demonstrates that a single base LLM can bootstrap both a task generator and a solver and improve itself without any external labeled data. The combination of a frontier curriculum based on self consistency, explicit encouragement of tool use, and ADPO for stable training on pseudo labels enables consistent gains on challenging benchmarks. This makes a strong case for co-evolutionary, tool-integrated training as a practical path toward autonomous model improvement.

References and resources

Paper: https://arxiv.org/pdf/2511.16043

Implementation notes: Agent0 builds on VeRL and uses a sandboxed Python interpreter for tool-integrated rollouts. The authors provide further code and materials in their repository and project pages.