Ant Group Unveils Ling 2.0 — Scaling Sparse MoE Reasoning to 1T with 1/32 Activation

Sparse MoE at the core

Ling 2.0 from Ant Group's Inclusion AI team is a reasoning-first family of sparse Mixture-of-Experts (MoE) language models that preserves low per-token compute while scaling from 16B to 1T parameters. The central idea is that each activation should directly improve reasoning behavior, so the architecture intentionally keeps a very small fraction of experts active for every token while offering a vast parameter pool overall.

A fixed activation recipe across sizes

Every Ling 2.0 model uses the same sparse MoE layer: 256 routed experts plus one always-on shared expert. The router selects 8 routed experts per token and always includes the shared expert, so about 9 experts out of 257 are active per token — roughly 3.5% activation, matching a 1/32 activation ratio. This design enables training and inference efficiency since only a small subset of the model is computed per token, producing reported efficiency gains of around 7x compared to equivalent dense models.

Architecture chosen by Ling Scaling Laws

Rather than trial-and-error, the team used what they call Ling Scaling Laws and a "Ling Wind Tunnel" of small MoE runs. These controlled experiments, trained under consistent data and routing rules, were fitted to power laws to predict loss, activation balance and expert utilization at larger scales. That prediction framework led to locking in the 1/32 activation, 256 routed experts and one shared expert before committing to 1T-scale GPU runs. Routing is implemented without auxiliary routing loss using sigmoid scoring, and the stack employs QK Norm, MTP loss and partial RoPE to maintain depth stability.

Training strategy and long-context focus

Pretraining spans more than 20T tokens. The pipeline starts with 4K context and progressively increases the share of reasoning-heavy sources such as math and code to nearly half the corpus. A mid-training stage extends context to about 32K on a chosen 150B token slice, injects another 600B tokens of high-quality chain-of-thought data, and finally stretches context to 128K using YaRN while preserving performance on short context inputs. This staged approach integrates long-context and reasoning capabilities early in model formation, not solely at later fine-tuning stages.

Post-training alignment in multiple stages

Alignment is split into capability and preference passes. Decoupled Fine Tuning (DFT) trains the model to switch between quick responses and deeper reasoning using different system prompts. An evolutionary Chain-of-Thought (Evo CoT) stage expands and diversifies reasoning traces, and a sentence-level policy optimization with a Group Arena Reward (LPO-style) aligns outputs to human judgments at fine granularity. This multi-step alignment helps the non-thinking base models reach strong math, code and instruction-following performance without inflating every answer.

Infrastructure optimizations

Ling 2.0 trains natively in FP8 with safeguards, keeping the loss trajectory close to BF16 while claiming about 15% better hardware utilization. Larger speedups, roughly 40%, come from heterogeneous pipeline parallelism, interleaved one-forward-one-backward execution and partitioning that is aware of MTP blocks. Warmup Stable Merge replaces traditional LR decay by merging checkpoints to stabilize training. Combined, these system techniques make 1T-scale runs practical on existing clusters.

Reported results and capacity points

The series includes three sizes:

Ling mini 2.0: 16B total parameters, about 1.4B activated per token, reported to match 7B–8B dense model quality and to generate over 300 tokens/sec in simple QA on H2O.
Ling flash 2.0: ~100B total parameters, about 6.1B activated per token, maintaining the same 1/32 activation recipe for higher capacity without increasing per-token compute.
Ling 1T: 1T total parameters with roughly 50B active per token, 128K context, and the full post-training Evo CoT plus LPO-style stack to push efficient reasoning.

Evaluations show small-activation MoE models can deliver competitive quality while keeping per-token compute low. Across sizes, the combination of sparse activation, FP8 training and a shared training schedule yields predictable scaling and multi-fold efficiency gains over dense baselines.

Key takeaways

A consistent 1/32 activation MoE design (256 routed experts + 1 shared expert) is applied from 16B to 1T.
The Ling Wind Tunnel and Ling Scaling Laws guide architecture choices so the same recipe transfers across scales.
Pretraining and mid-training emphasize early exposure to long context and reasoning data, not just SFT-time additions.
Post-training alignment separates capability and preference tuning to produce targeted improvements in math, code and instruction behavior.
Systems-level improvements including FP8, pipeline optimizations and checkpoint merging make trillion-parameter sparse models practical on current hardware.

For more technical details and experimental results, see the Ling 2.0 paper at https://arxiv.org/abs/2510.22115.