NVIDIA AI Unveils Nemotron 3 for Agentic AI

Overview

NVIDIA has released the Nemotron 3 family of open models as part of a full stack for agentic AI, including model weights, datasets, and reinforcement learning tools. The family comprises three sizes: Nano, Super, and Ultra, targeting multi-agent systems that require long context reasoning with tight control over inference cost.

Model Details

Nemotron 3 Nano: About 30 billion parameters with 3 billion active per token.
Nemotron 3 Super: Approx. 100 billion parameters with up to 10 billion active per token.
Nemotron 3 Ultra: Contains approximately 500 billion parameters with up to 50 billion active per token.

Target Workloads

The Nemotron 3 series is built for efficient open models in agentic applications.

Nano Model

Nemotron 3 Nano is a Mixture of Experts hybrid Mamba Transformer model that boasts about 31.6 billion parameters. It only activates 3.2 billion per forward pass, allowing for high representational capacity with minimal compute.

Super and Ultra Models

Super: Targeting high accuracy for large multi-agent applications.
Ultra: Aimed at complex research and planning workflows.

Performance Highlights

NiVDIA Nemotron 3 Nano delivers approximately four times the token throughput compared to Nemotron 2 Nano while significantly reducing reasoning token usage. It supports a native context length of up to 1 million tokens, ideal for multi-agent systems managing lengthy documents and large code bases.

Hybrid Mamba Transformer MoE Architecture

The core design features a Mixture of Experts hybrid Mamba Transformer architecture. By interleaving Mamba 2 blocks, attention blocks, and sparse expert blocks within a single stack, this architecture optimizes reasoning efficiency.

Components Explained

Long Range Modeling: Mamba 2 handles updates efficiently.
Sparse Expert Utilization: MoE enables parameter scaling without proportional compute scaling, focusing attention computation only where necessary.

For the Super and Ultra models, NVIDIA introduces LatentMoE for enhanced performance and efficiency in token prediction, enabling data projection into a lower-dimensional space.

Training Data and Precision

The training regime for Nemotron 3 incorporates an extensive dataset, with approximately 25 trillion tokens used, fostering diversity over previous models. Super and Ultra leverage a 4-bit floating point format (NVFP4), optimizing throughput and reducing memory pressure while retaining accuracy.

Key Takeaways

Open Model Family: Covers Nano, Super, and Ultra with parameters ranging from 30 billion to 500 billion.
Hybrid Architecture: Supports 1 million token context with a sparse Mixture of Experts approach.
Latent MoE: Reduces communication costs while allowing for more specialized experts.
Efficiency Focus: Trained on a massive dataset using NVFP4 precision to enhance throughput and minimize memory load.