<RETURN_TO_BASE

Nemotron-Elastic-12B — One Checkpoint That Yields 6B, 9B and 12B Models Without Extra Training

'NVIDIA releases Nemotron-Elastic-12B, a single 12B checkpoint that contains nested 9B and 6B variants with no extra training, cutting token and storage costs while keeping strong reasoning performance.'

Nemotron-Elastic-12B is NVIDIA AI's new approach to multi-size reasoning models: a single trained 12B parameter checkpoint that contains nested 9B and 6B variants. Instead of training or distilling each size separately, the team trains one elastic model that can be sliced into smaller submodels on demand, saving token cost and deployment memory.

Hybrid Mamba-2 and Transformer design

Architecturally, Nemotron Elastic builds on Nemotron-H principles and mixes Mamba-2 sequence state space blocks with a small set of global attention layers. The hybrid structure preserves long-range reasoning capacity while enabling structured shrinkage of components.

Elasticity through learned masks and a router

Elastic behavior is implemented by masks that control width and depth. Binary masks can reduce embedding channels, Mamba heads and channels, attention heads, and FFN intermediate sizes. Depth is reduced by dropping layers according to a learned importance ranking while residual paths maintain signal flow.

A router module predicts discrete configuration choices per deployment budget. Those choices are sampled with Gumbel Softmax and turned into masks applied to embeddings, Mamba projections, attention projections, and FFN matrices. Additional design details ensure validity of the SSM structure: group-aware SSM elastification consistent with Mamba head grouping, heterogeneous MLP elastification allowing layer-specific intermediate sizes, and normalized MSE based layer importance for depth decisions. Because smaller variants are prefix selections in ranked component lists, the 6B and 9B models are true nested subnetworks of the 12B parent.

Two-stage training targeted at reasoning

Training uses a frozen teacher model, Nemotron-Nano-V2-12B, and distills knowledge into the elastic student jointly across budgets. The optimization mixes knowledge distillation with standard language modeling loss and proceeds in two stages:

  • Stage 1: short context, sequence length 8192, batch size 1536, around 65B tokens, uniform sampling across 6B/9B/12B budgets.
  • Stage 2: extended context, sequence length 49152, batch size 512, around 45B tokens, non-uniform sampling that emphasizes the full 12B budget (weights 0.5 for 12B, 0.3 for 9B, 0.2 for 6B).

The extended context stage matters for reasoning: on AIME 2025, the 6B variant improved from 56.88 to 68.13 (about 19.8 percent relative gain), the 9B gained 9.7 percent, and the 12B gained 4.0 percent after Stage 2.

Benchmarks and performance

Nemotron Elastic was evaluated on reasoning-heavy benchmarks including MATH 500, AIME 2024 and 2025, GPQA, LiveCodeBench v5, and MMLU Pro. Key average pass@1 scores reported are roughly 70.61 for the 6B variant, 75.95 for 9B, and 77.41 for 12B. The 12B elastic model matches the NanoV2-12B baseline on average (77.41 vs 77.38), while the 9B elastic model closely tracks the NanoV2-9B baseline (75.95 vs 75.99). The 6B elastic model performs strongly for its size despite not being trained individually.

Token and memory savings

One of Nemotron Elastic's main aims is cost reduction. Producing 6B and 9B variants from a 12B parent requires only about 110B tokens in a single elastic distillation run. By comparison, NanoV2 pretraining for 6B and 9B totals about 40T tokens, and a NanoV2 compression baseline with Minitron SSM used about 750B tokens. The team reports roughly a 360x reduction versus training extra models from scratch and about 7x fewer tokens compared with the compression baseline.

Deployment memory is reduced as well: storing Nemotron Elastic 6B, 9B, and 12B together needs 24GB of BF16 weights, while storing NanoV2 9B plus 12B needs 42GB, a 43 percent memory reduction while exposing an extra 6B option.

Practical implications

Nemotron-Elastic-12B reframes multi-size reasoning model families as a single elastic systems problem. With a hybrid Mamba-2 and Transformer architecture, a learned routing module, and structured masks that preserve reasoning performance, a single checkpoint produces multiple competitive sizes. This simplifies fleet management, lowers token and storage costs, and gives teams options to deploy the same trained checkpoint to devices with different compute and memory budgets.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский