Tiny Recursive Model: How a 7M-Parameter Solver Outperforms Much Larger LLMs on ARC-AGI

October 9, 2025 · 3 min

What TRM is and why it matters

Samsung SAIT Montreal introduced the Tiny Recursive Model (TRM), a compact recursive reasoner with roughly 7 million parameters that challenges larger autoregressive LLMs on symbolic reasoning benchmarks. TRM is an iterative draft–revise solver that maintains a latent scratchpad and a current solution embedding, repeatedly refining its candidate answers through recursion rather than autoregressive token decoding.

Core architectural changes

TRM departs from the prior Hierarchical Reasoning Model (HRM) by simplifying the design into a single tiny recurrent core. Instead of HRM’s two-module hierarchy and fixed-point gradient approximation, TRM uses a 2-layer network that jointly updates a latent scratchpad z and a current solution embedding y. The model alternates between a “think” update and an “act” update:

think: update z ← f(x, y, z) for n inner steps
act: update y ← g(y, z)

This think→act block is unrolled up to 16 times with deep supervision and a learned halting head used during training, while tests use the full unroll. Signals are propagated across steps through the paired state (y, z).

TRM also moves away from HRM’s one-step implicit fixed-point gradient approximation by backpropagating through the entire recursion during training, which the authors report as essential for generalization.

Architectures and training details

A single tiny recurrent core of two layers replaces HRM’s two modules. Effective depth is achieved via recursion and unrolling rather than stacking many layers.
For ARC and large maze grids, the best-performing TRM variant retains self-attention. For small fixed grids like Sudoku, the team replaces self-attention with an MLP-Mixer-style token mixer to reduce overcapacity.
A small exponential moving average (EMA) over weights stabilizes training on limited data.
Typical unroll settings create net depth by recursion, for example T = 3 and n = 6. In ablation studies, the two-layer core generalizes better than deeper variants at similar compute.

Benchmark performance

TRM shows surprising performance gains on several benchmarks compared to larger models and prior specialized architectures:

ARC-AGI-1 / ARC-AGI-2 (two tries): TRM-Attn (7M) achieves 44.6% / 7.8% vs HRM (27M) at 40.3% / 5.0%.
Reported LLM baselines under the paper’s evaluation: DeepSeek-R1 (671B) 15.8% / 1.3%, o3-mini-high 34.5% / 3.0%, Gemini 2.5 Pro 37.0% / 4.9%.
Sudoku-Extreme (9×9, 1K train / 423K test): 87.4% with an attention-free mixer vs HRM 55.0%.
Maze-Hard (30×30): 85.3% vs HRM 74.5%.

These are trained-from-scratch direct-prediction models on small but heavily augmented datasets, not few-shot prompting. ARC remains the canonical target, with broader leaderboard context and thresholds tracked by the ARC Prize Foundation.

Why a 7M model can beat larger LLMs on these tasks

Decision-then-revision workflow: TRM drafts a full candidate solution and then refines it via iterative latent consistency checks against the input, reducing exposure bias from autoregressive token-by-token decoding when producing structured outputs.
Compute allocated to test-time reasoning: Effective depth comes from recursion and unrolling (approximate emulated depth ≈ T·(n+1)·layers). The team shows that depth via recursion yields better generalization at constant compute than simply adding parameters or layers.
Task-specific inductive biases: For small, fixed-grid tasks like Sudoku, attention-free mixers reduce model capacity and improve the bias/variance trade-off. Self-attention remains useful for larger spatial grids like 30×30 mazes.

Key takeaways

Architecture: TRM is a ~7M-parameter, 2-layer recursive solver that alternates latent think updates z ← f(x, y, z) and act refinements y ← g(y, z), unrolled up to 16 steps with deep supervision and full backprop through recursion.
Results: TRM reports ~44.6–45% on ARC-AGI-1 and ~7.8–8% on ARC-AGI-2 (two-try), outperforming several much larger LLM baselines on the stated public evaluations.
Implication: Allocating compute to recursive refinement and tighter task-specific inductive biases can outperform parameter scaling on symbolic and geometric reasoning benchmarks. The team released code on GitHub, providing a compact recipe for from-scratch training on these tasks.

For more technical detail, see the paper on arXiv: https://arxiv.org/pdf/2510.04871v1