<RETURN_TO_BASE

ASTRO Boosts Llama 3 Reasoning by Over 16% Using Post-Training Techniques

ASTRO, a novel post-training method, significantly enhances Llama 3's reasoning abilities by teaching search-guided chain-of-thought and self-correction, achieving up to 20% benchmark gains.

Enhancing Llama 3's Reasoning Without Architectural Changes

Improving the reasoning skills of large language models (LLMs) without modifying their architecture is a significant challenge. Researchers from Meta AI and the University of Washington introduced ASTRO (Autoregressive Search-Taught Reasoner), a post-training framework that enhances reasoning capabilities in Llama-3.1-70B-Instruct.

How ASTRO Works: Search-Guided Chain-of-Thought

ASTRO teaches Llama 3 to perform in-context search, self-reflection, and backtracking—techniques inspired by human problem-solving and symbolic search algorithms. The method starts with a Monte Carlo Tree Search (MCTS) over math problem-solving paths, exploring both correct and incorrect reasoning trajectories. These search trees are linearized into long chain-of-thoughts (CoT) that include failures and recoveries, rewritten in natural language for supervised fine-tuning.

This enables the model to not only proceed step-by-step but also reevaluate and backtrack when it detects potential mistakes. For example, the model might say, “Let’s go back to where we set up the equation,” indicating self-correction during reasoning.

Supervised Fine-Tuning with Search Priors

ASTRO fine-tunes Llama-3.1-70B-Instruct on 36,100 curated CoT solutions from datasets like MATH, AMC/AIME, and AoPS-style problems. This training yields significant improvements:

  • MATH 500: 69.6%
  • AMC 2023: 51.9%
  • AIME 2024: 16.3%

These results surpass baseline models and other variants trained without explicit search priors, showing that even supervised fine-tuning alone can boost reasoning performance by exposing the model to search-structured data.

Reinforcement Learning with Search-Aware Initialization

Following supervised fine-tuning, ASTRO applies reinforcement learning (RL) initialized from the SFT checkpoint. It uses a modified Group Relative Policy Optimization (GRPO) with verifiable reward signals (+1 for correct, -1 for incorrect) on 8,700 moderately difficult prompts. During RL, the model’s chain-of-thought length increases significantly, indicating deeper exploration.

Performance after RL training:

  • MATH 500: 81.8%
  • AMC 2023: 64.4%
  • AIME 2024: 30.0%

These scores rival or exceed those of larger models, confirming the effectiveness of ASTRO's search-aware training.

The Role of Backtracking in Reasoning Success

An important observation is a strong positive correlation (Pearson coefficient > 0.8) between the frequency of backtracking/self-correction and improved accuracy. ASTRO-RL increasingly performs self-reflective and corrective actions as training progresses, which directly relates to its better performance.

Comparative Advantages and Interpretability

Control experiments show ASTRO outperforms models trained on direct chain-of-thought solutions without search priors by margins of +2% to +3.9% across benchmarks. ASTRO also produces outputs that can be visualized as directed graphs, where nodes represent reasoning steps and edges represent transitions and corrections, enhancing interpretability.

ASTRO illustrates that effective reasoning improvements in LLMs can come from principled post-training methods that mimic search algorithms in natural language rather than from bigger models or longer pretraining. This approach sets a new standard for fine-tuning open language models to achieve human-like reasoning through search-inspired behaviors.

For more details, check out the original paper and follow the project researchers on Twitter or join the ML SubReddit and Newsletter communities.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский