<RETURN_TO_BASE

Cerebras Cuts MiniMax-M2 to 162B with REAP, Preserving 10B Active Capacity for Long-Context Coding Agents

Cerebras released MiniMax-M2-REAP-162B-A10B, a REAP-pruned SMoE checkpoint that trims experts to reduce memory while keeping near-original performance on code, reasoning and agentic tool calls.

New compressed MiniMax-M2 for coding and agents

Cerebras published MiniMax-M2-REAP-162B-A10B, a REAP-compressed Sparse Mixture-of-Experts (SMoE) causal language model derived from MiniMax-M2. The checkpoint is optimized for memory-sensitive deployment scenarios such as coding agents and tool calling by pruning experts while keeping the per-token active compute similar to a 10B dense model.

Architecture and core specifications

Key properties of MiniMax-M2-REAP-162B-A10B:

  • Base model: MiniMax-M2
  • Compression method: REAP (Router weighted Expert Activation Pruning)
  • Total parameters: 162B
  • Active parameters per token: 10B
  • Layers: 62 transformer blocks
  • Attention heads per layer: 48
  • Experts: 180 experts, obtained by pruning a 256-expert configuration
  • Activated experts per token: 8
  • Context length: 196,608 tokens
  • License: modified MIT, derived from MiniMaxAI MiniMax M2

Because of the SMoE design the model stores 162B parameters overall, but each token routes to a small subset of experts, so effective compute per token matches roughly a 10B dense model while retaining large-scale capacity in total parameters.

How REAP compresses MiniMax-M2

MiniMax-M2-REAP-162B-A10B was produced by applying REAP uniformly across all MoE blocks with roughly 30% expert pruning. REAP ranks experts by a saliency score that combines two signals:

  • Router gate values: how often and how strongly the router selects an expert
  • Expert activation norms: the magnitude of the expert output when it is active

Experts that contribute minimally under this combined criterion are removed. Surviving experts keep their original weights and the router maintains separate gates for each remaining expert. REAP is applied as a one-shot compression step in this release, without additional fine-tuning after pruning.

A theoretical result reported by the REAP authors notes that merging experts and summing their gates can cause functional subspace collapse: when multiple experts are merged, the router loses input-dependent control and a single merged expert must approximate a formerly input-dependent mixture, which introduces irreducible error when experts differ. Pruning, by contrast, removes experts but preserves independent control of survivors, so error scales with the gate weight of removed experts rather than causing collapse.

Across SMoE models from roughly 20B to 1T parameters, REAP reportedly outperforms expert merging and alternative pruning criteria on generative tasks such as code generation, mathematical reasoning and tool calling, particularly at higher compression rates like 50%.

Accuracy at 30% expert pruning

Cerebras compares three checkpoints on standard coding, reasoning and agentic benchmarks:

  • MiniMax-M2 (230B, base model)
  • MiniMax-M2-REAP-172B-A10B (25% pruning)
  • MiniMax-M2-REAP-162B-A10B (30% pruning)

On coding benchmarks such as HumanEval, HumanEval Plus, MBPP and MBPP Plus the 162B REAP model stays very close to the base MiniMax-M2. HumanEval results sit around the 90% range and MBPP in the 80% range, with the 172B and 162B REAP variants tracking the original within a few points.

On reasoning tasks like AIME 25 and MATH 500 the models show only small shifts and no collapse at 30% pruning. On tool calling and agentic evaluations, represented by the τ2 bench in a telecom setting, the 162B REAP model matches the base model within small variance. The model card states that the checkpoint maintains almost identical behavior while being about 30% lighter in parameter count.

These findings align with the broader REAP study, which reports near-lossless compression for code generation and tool calling on several large SMoE architectures when applying the REAP pruning criterion.

Deployment, memory usage and throughput

Cerebras supplies a direct vLLM serve example and positions MiniMax-M2-REAP-162B-A10B as a drop-in replacement for existing MiniMax M2 integrations. The model card recommends lowering --max-num-seqs, for example to 64, if runs hit memory limits to keep batch size in check on a given GPU.

vllm serve cerebras/MiniMax-M2-REAP-162B-A10B \
    --tensor-parallel-size 8 \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --trust-remote-code \
    --enable_expert_parallel \
    --enable-auto-tool-choice

Key takeaways

  • SMoE with efficient per-token compute: MiniMax-M2-REAP-162B-A10B stores 162B parameters but routes tokens to a small expert subset so compute per token is close to a 10B dense model.
  • REAP pruning preserves MiniMax-M2 behavior: Router weighted Expert Activation Pruning removes about 30% of experts based on router gate values and activation norms while keeping survivors and routing structure intact.
  • Near-lossless at 30% compression: On code, reasoning and tool-use benchmarks the 162B REAP variant tracks the 230B base model and a 172B REAP variant within a few points.
  • Pruning beats merging for generative SMoE: REAP avoids functional subspace collapse associated with merging and performs better across large SMoE models in generative tasks.

For model downloads and more details check the Cerebras model card on Hugging Face and the accompanying resources on the project pages.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский