NVIDIA and Mistral AI Achieve 10x Faster Inference Speed

Partnership Overview

NVIDIA announced a significant expansion of its strategic collaboration with Mistral AI. This partnership coincides with the release of the new Mistral 3 frontier open model family, marking a pivotal moment where hardware acceleration and open-source model architecture have converged to redefine performance benchmarks.

Performance Improvement: A Game Changer

This collaboration leads to a massive leap in inference speed: the new models now run up to 10x faster on NVIDIA GB200 NVL72 systems compared to the previous generation H200 systems, unlocking unprecedented efficiency for enterprise-grade AI. This promises to solve the latency and cost bottlenecks that have historically plagued large-scale deployment of reasoning models.

A Generational Leap: Focus on Blackwell

As enterprise demand shifts from simple chatbots to high-reasoning, long-context agents, inference efficiency has become a critical bottleneck. The collaboration between NVIDIA and Mistral AI addresses this head-on by optimizing the Mistral 3 family specifically for the NVIDIA Blackwell architecture.

Where production AI systems must deliver both strong user experience (UX) and cost-efficient scale, the NVIDIA GB200 NVL72 provides up to 10x higher performance than the previous H200 generation. This translates to significantly higher energy efficiency, exceeding 5,000,000 tokens per second per megawatt (MW) at user interactivity rates of 40 tokens per second.

The Mistral 3 Family: Engineered for Efficiency

The driving force behind this performance leap is the newly released Mistral 3 family, delivering industry-leading accuracy, efficiency, and customization capabilities. The suite covers the spectrum from massive data center workloads to edge device inference.

Mistral Large 3: State-of-the-Art Model

At the top of this hierarchy sits Mistral Large 3, a state-of-the-art sparse Multimodal and Multilingual Mixture-of-Experts (MoE) model.

Total Parameters: 675 Billion
Active Parameters: 41 Billion
Context Window: 256K tokens
Trained on NVIDIA Hopper GPUs, Mistral Large 3 is designed to handle complex reasoning tasks, offering parity with top-tier closed models while retaining the flexibility of open weights.

Ministral 3: High-Performance Edge Models

Complementing the large model is the Ministral 3 series, featuring small, dense, high-performance models designed for speed and versatility.

Sizes: 3B, 8B, and 14B parameters.
Variants: Base, Instruct, and Reasoning for each size (nine models total).
Context Window: 256K tokens across the board.
The Ministral 3 series excels at GPQA Diamond Accuracy benchmark by utilizing 100 less tokens while delivering higher accuracy.

Technical Advancement: Optimization Stack

The claim of "10x" performance is driven by a comprehensive stack of optimizations co-developed by Mistral and NVIDIA engineers using an approach of extreme co-design.

TensorRT-LLM Wide Expert Parallelism (Wide-EP)

To fully exploit the massive scale of the GB200 NVL72, NVIDIA employed Wide Expert Parallelism within TensorRT-LLM. This technology boosts performance by optimizing MoE GroupGEMM kernels, expert distribution, and load balancing. It also exploits the NVL72’s coherent memory domain and NVLink fabric.

Native NVFP4 Quantization

A significant technical advancement is support for NVFP4, a quantization format native to the Blackwell architecture. For Mistral Large 3, developers can deploy an NVFP4 checkpoint quantized offline using the open-source llm-compressor library.

Disaggregated Serving with NVIDIA Dynamo

Mistral Large 3 utilizes NVIDIA Dynamo, a low-latency distributed inference framework, to disaggregate the prefill and decode phases of inference, significantly boosting performance for long-context workloads.

Broad Deployment Capabilities: Cloud to Edge

Optimization efforts extend beyond data centers. The Ministral 3 series is engineered for edge deployment, offering flexibility for various needs.

RTX and Jetson Acceleration

The dense Ministral models are optimized for platforms like the NVIDIA GeForce RTX AI PC and NVIDIA Jetson robotics modules.

Broad Framework Support

NVIDIA collaborates with open-source communities to ensure these models are usable everywhere across various frameworks, including Llama.cpp and vLLM.

Production-Ready Solutions with NVIDIA NIM

To streamline enterprise adoption, the new models are accessible through NVIDIA NIM microservices, allowing deployment with minimal setup.

A New Standard for Open Intelligence

The release of the NVIDIA-accelerated Mistral 3 models represents a major leap for AI in the open-source community. With expected optimizations such as speculative decoding and multitoken prediction further pushing performance, Mistral 3 is pivotal for AI applications.