Meta’s Metacognitive Reuse: Turning Repeated Thought into a Behavior Handbook to Cut Tokens by 46%
Meta researchers propose a technique that compresses recurring chain-of-thought reasoning into short, named procedures called behaviors, then reuses or distills them to make large-language-model reasoning far more token-efficient.
Why this matters
Long chain-of-thought (CoT) traces often re-derive the same subroutines—inclusion–exclusion, base conversions, common geometric steps—over and over. That redundancy increases output length, raises latency, and uses up budget that could be spent exploring novel reasoning paths. Meta frames the fix as procedural memory for LLMs: a compact, searchable handbook of how-to steps that the model can consult or internalize.
How the pipeline works
The system uses three roles to build and apply a behavior handbook:
- Metacognitive Strategist (R1-Llama-70B): solves problems to produce traces, reflects on those traces to identify reusable steps, and emits behaviors as short name→instruction pairs that populate the handbook.
- Teacher (LLM B): generates behavior-conditioned responses that form training targets.
- Student (LLM C): either consumes behaviors in-context at inference or is fine-tuned on behavior-conditioned outputs so the behaviors become parametric.
Retrieval is topic-based for MATH and embedding-based (BGE-M3 + FAISS) for AIME. Prompts include explicit solution, reflection, behavior extraction, and behavior-conditioned inference (BCI) templates. In BCI, the model is instructed to reference behaviors explicitly, which yields short, structured derivations.
Modes of evaluation and use
- Behavior-Conditioned Inference (BCI): retrieve K relevant behaviors and prepend them to the prompt so the student cites and uses them.
- Behavior-Guided Self-Improvement: extract behaviors from a model’s own earlier attempts and feed them back to improve revisions.
- Behavior-Conditioned SFT (BC-SFT): fine-tune students on teacher outputs that follow behavior-guided reasoning, enabling behavior usage without retrieval at test time.
Key results on MATH and AIME
- Token efficiency: On MATH-500, BCI reduces reasoning tokens by up to 46% versus the same model without behaviors while matching or improving accuracy. This holds across R1-Llama-70B and Qwen3-32B and across token budgets (2,048–16,384).
- Self-improvement gains: On AIME-24, behavior-guided self-improvement outperforms a critique-and-revise baseline at most budgets, reaching up to 10% higher accuracy as budgets grow.
- BC-SFT quality lift: Fine-tuned students across several model families (Llama-3.1-8B-Instruct, Qwen2.5-14B, Qwen2.5-32B, Qwen3-14B) consistently beat standard SFT and base models in accuracy while remaining more token-efficient.
Importantly, the improved generalization is not attributable to an easier training corpus: teacher correctness rates in original and behavior-conditioned training sets are similar, yet BC-SFT students generalize better.
What a behavior looks like
Behaviors are compact name→instruction pairs, from general strategies to concrete mathematical tools, for example:
- behavior_inclusion_exclusion_principle: avoid double counting by subtracting intersections
- behavior_translate_verbal_to_equation: formalize word problems systematically
- behavior_distance_from_point_to_line: apply |Ax+By+C|/√(A²+B²) for tangency checks
During BCI, students explicitly cite behaviors as they use them, making traces auditable and compact.
Retrieval, cost and latency considerations
BCI adds input tokens (the behaviors), but those inputs are pre-computable and non-autoregressive. On some commercial APIs input tokens are billed cheaper than output tokens, so shrinking output length can lower cost and latency. BC-SFT removes retrieval entirely at test time by baking behavior usage into model weights.
Why this approach works and open questions
Storing procedural instructions complements retrieval-augmented generation’s declarative memory: behaviors capture how to reason, not just what to recall. By replacing verbose derivations with concise reusable steps, the model avoids re-derivation and can reallocate compute to novel subproblems. Behavior prompts bias the decoder toward efficient, correct trajectories, and BC-SFT internalizes those trajectories so models can invoke them implicitly.
Open engineering questions include scaling the handbook beyond math, organizing a growing behavior corpus, and maintaining quality and relevance as behaviors accumulate.
For more details see the paper: https://arxiv.org/pdf/2509.13237