MLPerf Inference v5.1 (2025): What the New Results Mean for GPUs, CPUs and AI Accelerators
What MLPerf Inference actually measures
MLPerf Inference evaluates complete systems (hardware, runtime, and serving stack) running fixed, pre-trained models under strict latency and accuracy constraints. Results are produced for Datacenter and Edge suites using LoadGen request patterns to preserve architectural neutrality and reproducibility. The Closed division locks model and preprocessing for direct apples-to-apples comparisons; the Open division permits model changes that make cross-submission comparisons less direct. Availability tags — Available, Preview, RDI — clarify whether a configuration is shipping or experimental.
What changed in v5.1 (2025)
The v5.1 update (published Sept 9, 2025) adds three modern workloads and expands interactive serving coverage:
- DeepSeek-R1: the first reasoning benchmark focused on non-trivial control flow and exact-match quality.
- Llama-3.1-8B: a summarization workload that replaces GPT-J in the Closed set.
- Whisper Large V3: updated ASR workload.
This round included 27 submitters and first appearances by new silicon and SKUs such as AMD Instinct MI355X, Intel Arc Pro B60 48GB Turbo, NVIDIA GB300, RTX 4000 Ada-PCIe-20GB, and RTX Pro 6000 Blackwell Server Edition. Interactive scenarios were broadened beyond a single model to better reflect agent/chat workloads with tight TTFT/TPOT limits.
Serving scenarios and how they map to real workloads
MLPerf defines four scenarios you should map to production SLAs:
- Offline: maximize throughput without a latency bound; batching and scheduler strategies dominate.
- Server: Poisson arrival pattern with p99 latency bounds; closest match to chat and agent backends.
- Single-Stream / Multi-Stream (Edge): strict per-stream tail latency for single streams; multi-stream stresses concurrency at fixed inter-arrival intervals.
Each scenario reports a defined metric — for example, max Poisson throughput for Server and total throughput for Offline.
Latency metrics for LLMs: TTFT and TPOT
LLM workloads include TTFT (time-to-first-token) and TPOT (time-per-output-token) as first-class metrics. v5.0 already tightened interactive limits for Llama-2-70B (p99 TTFT 450 ms, TPOT 40 ms) to better reflect user-perceived responsiveness. Larger long-context models like Llama-3.1-405B have higher bounds (p99 TTFT 6 s, TPOT 175 ms) due to model size and context handling. Those constraints carry into v5.1 and apply to the new LLM and reasoning tasks.
Datacenter targets and key workloads
Closed-division, datacenter-focused targets in v5.1 include:
- LLM Q&A: Llama-2-70B (OpenOrca) — Conversational: 2000 ms/200 ms; Interactive: 450 ms/40 ms; quality gates at 99% and 99.9%.
- LLM Summarization: Llama-3.1-8B (CNN/DailyMail) — Conversational: 2000 ms/100 ms; Interactive: 500 ms/30 ms.
- Reasoning: DeepSeek-R1 — TTFT 2000 ms / TPOT 80 ms; 99% of FP16 exact-match baseline.
- ASR: Whisper Large V3 (LibriSpeech) — WER-based quality requirements.
- Long-context LLM: Llama-3.1-405B — TTFT 6000 ms, TPOT 175 ms.
Legacy CV and NLP entries (ResNet-50, RetinaNet, BERT-L, DLRM, 3D-UNet) remain to preserve continuity across cycles.
Power reporting and energy claims
MLPerf Power is optional but reports system wall-plug energy for runs (Server/Offline: system power; Single/Multi-Stream: energy per stream). Only measured power runs are valid for energy-efficiency comparisons; vendor TDPs or estimates are out of scope. v5.1 includes datacenter and edge power submissions, and wider participation would improve the dataset.
How to read the result tables without fooling yourself
A few practical rules:
- Compare Closed vs Closed only. Open runs may use different models or quantization and are not directly comparable.
- Match accuracy targets: stricter quality (99.9% vs 99%) usually lowers throughput.
- Treat MLPerf numbers as system-level throughput under constraints. Per-chip numbers derived by dividing by accelerator count are not defined by MLPerf and should be used only for budget sanity checks, not marketing claims.
- Prefer entries tagged Available and include Power columns when efficiency matters.
Interpreting the 2025 results across architectures
GPUs: New silicon shows up strongly in Server-Interactive and long-context workloads where scheduler efficiency, KV-cache handling, and memory management matter alongside raw FLOPs. Rack-scale systems (e.g., GB300 NVL72 class) deliver highest aggregate throughput; normalize by accelerator and host counts when comparing to single-node entries.
CPUs: CPU-only entries continue to serve as baselines and highlight host-side preprocessing and dispatch overheads that bottleneck accelerators in Server mode. New Xeon 6 submissions and mixed CPU+GPU stacks appear in v5.1; always check host generation and memory configuration.
Alternative accelerators: v5.1 increases architectural diversity. For Open-division submissions (pruned or low-precision variants), validate that cross-system comparisons hold constant division, model, dataset, scenario, and accuracy.
A practical playbook: map benchmarks to SLAs
- Interactive chat/agents → use Server-Interactive on Llama-2-70B, Llama-3.1-8B or DeepSeek-R1; validate p99 TTFT/TPOT and quality.
- Batch summarization/ETL → use Offline on Llama-3.1-8B; rack-level throughput drives cost.
- ASR front-ends → use Whisper V3 Server with tail-latency bound; pay attention to memory bandwidth and audio pre/post-processing.
- Long-context analytics → use Llama-3.1-405B and verify whether UX tolerates 6 s TTFT and 175 ms TPOT.
What the 2025 cycle signals for procurement
Interactive LLM serving is table stakes: tight TTFT/TPOT limits expose scheduling, batching, paged attention and KV-cache differences that can reorder leaders compared with Offline-focused comparisons. Reasoning workloads like DeepSeek-R1 stress control-flow and memory traffic differently than next-token generation. Broader modality coverage (Whisper V3 and SDXL) surfaces I/O and bandwidth constraints beyond token decoding. Procurement should filter results by workloads that mirror production SLAs and validate claims on the MLCommons result pages and power methodology.
References
https://mlcommons.org/2025/09/mlperf-inference-v5-1-results/ https://mlcommons.org/benchmarks/inference-datacenter/ https://mlcommons.org/benchmarks/inference-edge/ https://docs.mlcommons.org/inference/ https://docs.mlcommons.org/inference/power/ https://mlcommons.org/2024/03/mlperf-llama2-70b/ https://mlcommons.org/2025/09/deepseek-inference-5-1/ https://blogs.nvidia.com/blog/mlperf-inference-blackwell-ultra/ https://developer.nvidia.com/blog/nvidia-blackwell-ultra-sets-new-inference-records-in-mlperf-debut/ https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference-v5.1/README.html https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference5.1-repro/README.html https://newsroom.intel.com/artificial-intelligence/intel-arc-pro-b-series-gpus-and-xeon-6-shine-in-mlperf-inference-v5-1 https://www.globenewswire.com/news-release/2025/09/09/3147136/0/en/MLCommons-Releases-New-MLPerf-Inference-v5-1-Benchmark-Results.html https://www.tomshardware.com/pc-components/gpus/nvidia-claims-software-and-hardware-upgrades-allow-blackwell-ultra-gb300-to-dominate-mlperf-benchmarks-touts-45-percent-deepseek-r-1-inference-throughput-increase-over-gb200 https://newsroom.intel.com/tag/intel-arc-pro-b60