<RETURN_TO_BASE

Nemotron Nano 2: 128K-Context LLMs That Run Up to 6× Faster on a Single A10G

'NVIDIA's Nemotron Nano 2 delivers hybrid Mamba-Transformer LLMs that run up to 6× faster and support 128K-token context on a single A10G GPU, with most training data and recipes open-sourced.'

NVIDIA's Nemotron Nano 2: speed and long-context capability for practical deployment

Nemotron Nano 2 is NVIDIA’s new family of hybrid Mamba-Transformer large language models (LLMs) designed for high-throughput reasoning and long-context tasks while remaining deployable on a single midrange GPU. The release emphasizes transparency: most pretraining and fine-tuning datasets, recipes, and model checkpoints are published to enable reproducibility and community use.

Key highlights

  • Up to 6.3× inference throughput vs. similarly sized models in reasoning-heavy scenarios, achieving major speedups without sacrificing accuracy.
  • Strong performance on reasoning, coding, math and multilingual benchmarks, often matching or exceeding open competitors.
  • Native 128K-token context inference on a single NVIDIA A10G (22 GiB) GPU, enabled by efficient pruning and a hybrid architecture.
  • Extensive open data and weights: pretraining corpora and many post-training datasets are published with permissive licenses on Hugging Face.

Hybrid architecture: Mamba meets transformer

Nemotron Nano 2 uses a hybrid Mamba-Transformer backbone inspired by Nemotron-H. Most self-attention layers are replaced by Mamba-2 state-space layers, while roughly 8% of layers retain sparse self-attention to preserve long-range dependencies. The 9B-parameter variant is described with specifics:

  • 9B parameters with 56 active layers (from a pre-trained 62).
  • Hidden size of 4480 and grouped-query attention combined with Mamba-2 layers.
  • Interleaving of state-space layers and sparse attention plus large feed-forward networks for both throughput and long-sequence retention.

This mix is targeted at tasks that require long, stepwise reasoning or 'thinking traces' where traditional transformer-only designs become memory- or compute-bound.

Training recipe and open sourcing

NVIDIA trained and distilled Nemotron Nano 2 from a 12B teacher model into more compact 9B variants using a broad and curated corpus. The pretraining scale is about 20 trillion tokens, covering web, math, code, multilingual content, academic and STEM domains. Major released datasets include:

  • Nemotron-CC-v2: multilingual web crawl across 15 languages with synthetic Q&A rephrasings and deduplication.
  • Nemotron-CC-Math: a 133B-token math corpus standardized to LaTeX, with a 52B-token high-quality subset.
  • Nemotron-Pretraining-Code: curated GitHub source code with decontamination and deduplication.
  • Nemotron-Pretraining-SFT: synthetic instruction-following datasets across STEM and reasoning domains.

Post-training includes over 80B tokens of supervised fine-tuning (SFT), RLHF, tool-calling data and multilingual fine-tuning. Most of these datasets, along with recipes and checkpoints, are released to the public to foster reproducibility.

Compression, distillation and memory-aware design

NVIDIA uses a multi-stage compression pipeline including knowledge distillation (from 12B to 9B), Mamba pruning frameworks and memory-targeted neural architecture search. Key points:

  • Pruning reduces layer counts, FFN widths and embedding sizes while retaining critical capacity.
  • Multi-stage SFT and reinforcement techniques (DPO, GRPO, controllable reasoning budgets) refine instruction following and tool use.
  • Architecture search and cache-aware pruning ensure the model and key-value cache fit within an A10G’s 22 GiB memory at 128K context length.

These optimizations yield the practical ability to run large-context inference and maintain high token generation speeds on midrange hardware.

Benchmarks and throughput

In head-to-head evaluations, Nemotron-Nano-9B-v2 demonstrates strong results across multiple benchmarks (MMLU, GSM8K CoT, MATH, HumanEval+, RULER-128K, multilingual math) and shows considerable throughput advantages in reasoning trace scenarios. Reported throughput gains reach up to 6.3× over models like Qwen3-8B for generation-heavy tasks. The models can maintain batch size 1 inference at 128K context on an A10G, a previously impractical capability for many open models.

Implications for developers and research

Nemotron Nano 2 lowers practical barriers to experimenting with long-context LLM workflows and real-world reasoning use cases. By open-sourcing extensive datasets and recipes, NVIDIA encourages reproducible research, community scrutiny and faster iteration on long-context model techniques. The combination of hybrid state-space layers and sparse attention offers a path for other groups to explore similar trade-offs between throughput, memory footprint and long-range dependency retention.

Where to find models and technical material

NVIDIA publishes technical details, the paper, datasets and model checkpoints on Hugging Face and provides tutorials, code and notebooks via GitHub to help users reproduce results and experiment with the models.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский