Adaptive Reasoning Model (ARM) and Ada-GRPO: Revolutionizing Efficient AI Problem-Solving

Challenges in AI Reasoning

Reasoning tasks in artificial intelligence cover commonsense understanding, mathematical problem-solving, and symbolic reasoning. These tasks require multiple logical inference steps, which large language models (LLMs) attempt to replicate using structured methods like chain-of-thought (CoT) prompting. However, as LLMs grow larger, they tend to generate longer responses regardless of task complexity, causing inefficiency and sometimes reduced accuracy due to over-explaining.

Limitations of Current Models

Most reasoning models, including popular ones like OpenAI’s o1 and DeepSeek-R1, apply a uniform long CoT strategy for all tasks. This leads to the “overthinking” problem, where simpler tasks receive unnecessarily verbose outputs, wasting computational resources and potentially harming accuracy. Existing solutions like prompt-guided generation or token budget estimation rely on fixed assumptions that do not generalize well across diverse tasks.

Previous Approaches and Their Drawbacks

Methods such as GRPO (Group Relative Policy Optimization), length-penalty mechanisms, and rule-based prompt controls have been explored to address these inefficiencies. While GRPO rewards correct answers and encourages learning different reasoning strategies, it suffers from “format collapse,” where models overuse Long CoT at the expense of more efficient formats like Short CoT or Direct Answer. Length-penalty techniques can reduce output length but often harm accuracy on complex problems. These approaches struggle to balance reasoning effectiveness and efficiency consistently.

Introduction of Adaptive Reasoning Model (ARM) and Ada-GRPO

Researchers from Fudan University and Ohio State University propose the Adaptive Reasoning Model (ARM), which dynamically adjusts reasoning formats based on task difficulty. ARM supports four reasoning formats:

Direct Answer for simple queries
Short Chain-of-Thought (Short CoT) for concise reasoning
Code for structured problem-solving
Long Chain-of-Thought (Long CoT) for complex multi-step inference

ARM operates primarily in Adaptive Mode, selecting the suitable format automatically. It also offers Instruction-Guided and Consensus-Guided Modes for explicit format control and aggregation.

A key innovation is Ada-GRPO, an extension of GRPO that incorporates a format diversity reward. This mechanism prevents the dominance of Long CoT and encourages exploration of simpler, efficient reasoning formats.

Training Framework of ARM

ARM training involves two stages:

Supervised Fine-Tuning (SFT): Using 10.8K annotated questions across four reasoning formats from datasets like AQuA-Rat and generated with GPT-4o and DeepSeek-R1, the model learns format structures without adaptiveness.
Ada-GRPO Training: The model receives scaled rewards for using less frequent formats (e.g., Direct Answer or Short CoT). A decaying factor shifts the reward focus back to accuracy over time, avoiding long-term bias and enabling dynamic strategy selection.

Performance and Efficiency Gains

ARM achieves significant efficiency improvements without compromising accuracy. It reduces token usage by approximately 30% on average and up to 70% on simpler tasks compared to models relying solely on Long CoT. Training speed is doubled compared to GRPO-based models.

Examples of ARM's performance include:

ARM-7B: 75.9% accuracy on AIME’25 task with 32.5% fewer tokens
ARM-14B: 85.6% accuracy on OpenBookQA and 86.4% on MATH dataset with over 30% token reduction compared to Qwen2.5SFT+GRPO models

Conclusion

ARM combined with Ada-GRPO offers a flexible, scalable solution to the inefficiencies of current reasoning models by adapting reasoning strategies to task complexity. This approach balances accuracy and computational resources effectively, paving the way for more efficient large language models.

For more details, check the original paper, models on Hugging Face, and the project page. Follow the researchers on Twitter and join the ML community for updates.