Why Apple's Criticism of AI Reasoning Falls Short: A Closer Look at the Debate

Conflicting Views on AI Reasoning Capabilities

Recently, the abilities of Large Reasoning Models (LRMs) in AI have been the subject of debate fueled by two contrasting papers. Apple's "Illusion of Thinking" argues that LRMs have fundamental reasoning limits, while Anthropic's response, "The Illusion of the Illusion of Thinking," challenges these claims, attributing the perceived failures to flaws in evaluation methods rather than the models themselves.

Apple's Findings on Reasoning Limits

Apple conducted systematic tests on LRMs using controlled puzzle environments such as Tower of Hanoi and River Crossing. They observed an "accuracy collapse" as puzzle complexity increased, with models like Claude-3.7 Sonnet and DeepSeek-R1 failing to solve more complex problems and showing reduced token usage, indicating diminished reasoning effort. Apple identified three complexity regimes: low complexity where standard LLMs outperform LRMs, medium complexity where LRMs excel, and high complexity where both types collapse. The study concluded that LRMs struggle with exact computation and consistent algorithmic reasoning across puzzles.

Anthropic's Critique of Apple's Methodology

Anthropic points out critical flaws in Apple's experimental design:

Token Limitations vs. Logical Failures: The failures noted in Tower of Hanoi experiments were primarily due to token output constraints, with models intentionally truncating their outputs rather than reasoning deficiencies.
Misclassification of Reasoning Breakdown: Apple's automated evaluation mistakenly treated intentional output truncations as reasoning failures, unfairly penalizing the models.
Unsolvable Problems Misinterpreted: Some River Crossing puzzles used by Apple were mathematically unsolvable, such as cases with six or more individuals but a boat capacity of three. Scoring these as failures distorted the evaluation.

Anthropic also showed that when models provide concise solutions (e.g., Lua functions), their accuracy on complex puzzles improves significantly, indicating that evaluation design, not reasoning ability, was the issue.

Rethinking Complexity Metrics

Anthropic critiques Apple's use of compositional depth (number of moves) as a complexity metric, noting it mixes mechanical execution with cognitive difficulty. For instance, Tower of Hanoi requires many moves but simple decisions, whereas River Crossing involves fewer steps but greater cognitive challenges due to constraints and search processes.

Implications for AI Evaluation

The debate highlights significant gaps in current AI evaluation:

Differentiating reasoning ability from practical constraints like token limits.
Ensuring that tested problems are solvable.
Refining complexity metrics to reflect genuine cognitive challenges.
Exploring diverse solution formats to better assess reasoning capabilities.

Conclusion

Apple's assertion that LRMs inherently lack robust reasoning appears premature. Anthropic's rebuttal suggests LRMs possess sophisticated reasoning when evaluated properly, underscoring the need for nuanced and careful assessment methods to truly understand AI models' strengths and limitations.

For more details, see the Apple and Anthropic papers. Follow related discussions on Twitter and join the ML SubReddit and Newsletter for updates.