Tsinghua University's 'Absolute Zero': Training AI Models Without External Data

Advancing Reasoning in Large Language Models

Large Language Models (LLMs) have enhanced reasoning through Reinforcement Learning with Verifiable Rewards (RLVR), which emphasizes outcome-based feedback rather than mimicking reasoning steps. However, current RLVR methods face scalability issues due to dependence on manually curated datasets of questions and answers. As reasoning models evolve, creating vast, high-quality datasets becomes unsustainable and may limit AI's autonomous learning capabilities.

Evolution of Self-Improving Models

Various approaches have been developed to improve LLM reasoning. The STaR method introduced self-bootstrapping using expert iterations and outcome-verified responses. The o1 model scaled this idea achieving top results, followed by R1, which applied RL directly to base LLMs. Self-play paradigms have also progressed from early two-agent systems like Schmidhuber’s to advanced implementations such as AlphaGo and AlphaZero. Recent models like SPIN, Self-Rewarding Language Models, SPC, and SPAG use self-play for better alignment and reasoning.

Introducing Absolute Zero and AZR

Researchers from Tsinghua University, Beijing Institute for General Artificial Intelligence, and Penn State proposed the Absolute Zero paradigm to allow a single model to autonomously generate and solve tasks that maximize its learning without external data. The Absolute Zero Reasoner (AZR) self-evolves its curriculum and reasoning through a code executor that validates generated code reasoning tasks and verifies answers, providing a consistent reward signal for grounded learning. AZR works effectively across model sizes and types.

AZR’s Training Mechanism

In multitask learning, AZR generates new reasoning tasks based on previous examples, attempts to solve them, and receives feedback via a code executor that constructs, executes, and validates code tasks. The AZR algorithm involves buffer initialization, task proposal input and management, task construction, solution validation, and advantage estimation through Task-Relative REINFORCE++.

Performance and Scalability

The Absolute Zero Reasoner-Coder-7B model achieved state-of-the-art results in overall and coding benchmarks, surpassing previous models by 1.8 percentage points despite no exposure to external data. It outperformed human-data-trained models in coding by 0.3 points. Larger models (7B and 14B) showed continued performance gains beyond 200 training steps, with out-of-distribution improvements increasing with model size (+5.7%, +10.2%, +13.2% for 3B, 7B, and 14B respectively).

Safety Considerations and Future Directions

While Absolute Zero reduces the need for human-curated datasets, safety remains a challenge. The team observed "uh-oh moments"—safety concerns during chain-of-thought reasoning in Llama-3.1-8B. Continuous oversight is necessary to manage these risks, marking an important area for future research.

For detailed information, check the Paper, Model on Hugging Face, and GitHub Page. Follow updates on Twitter.