Qwen3-Next-80B-A3B FP8 Release Makes 80B/3B Hybrid-MoE Practical on Commodity GPUs

September 22, 2025 · 4 min

Overview

Alibaba’s Qwen team published FP8-quantized checkpoints for Qwen3-Next-80B-A3B in two post-training variants: Instruct and Thinking. These FP8 builds mirror the BF16 releases but ship “fine-grained FP8” weights (block size 128) and deployment guidance for current sglang and vLLM nightly builds. Benchmarks remain those reported for the BF16 models; FP8 is provided to improve convenience and performance rather than as a separate benchmark run.

Architecture and A3B stack

Qwen3-Next-80B-A3B is a hybrid design combining Gated DeltaNet (a linear/conv-style attention surrogate) with Gated Attention, interleaved with an ultra-sparse Mixture-of-Experts (MoE). The architecture spans 48 layers organized into 12 blocks: three repetitions of (Gated DeltaNet → MoE) followed by one block of (Gated Attention → MoE). The model’s 80B parameter budget effectively activates roughly 3B parameters per token via 512 experts (10 routed + 1 shared). Native context is 262,144 tokens and was validated up to about 1,010,000 tokens using RoPE scaling (YaRN).

Key model specs include a hidden size of 2048; attention with 16 Q heads and 2 KV heads at head dim 256; and DeltaNet using 32 V and 16 QK linear heads at head dim 128. The team reports the 80B-A3B base model outperforms Qwen3-32B on downstream tasks at around 10% of its training cost and delivers roughly 10× inference throughput beyond 32K context, driven by low MoE activation and multi-token prediction (MTP).

Instruct vs Thinking variants

The Instruct variant is configured without built-in reasoning traces (no tags), aimed at instruction-following tasks. The Thinking variant enables reasoning traces by default and is optimized for complex problem solving; it also uses RL post-training techniques and a reasoning parser recommendation in deployment to handle its hybrid attention and high-sparsity MoE behavior.

What’s new with the FP8 releases

The FP8 model cards emphasize “fine-grained FP8” quantization with block size 128. Deployment steps differ slightly from BF16: both sglang and vLLM require current main/nightly builds, and the model cards include example commands for running 256K contexts and optional MTP. The Thinking FP8 card recommends enabling a reasoning parser flag (for example, –reasoning-parser deepseek-r1 in sglang or deepseek_r1 in vLLM). Licensing remains Apache-2.0.

Benchmarks and reported performance

Qwen’s FP8 Instruct card reproduces the BF16 comparison table, positioning Qwen3-Next-80B-A3B-Instruct near Qwen3-235B-A22B-Instruct-2507 across multiple knowledge, reasoning, and coding benchmarks, with an advantage on long-context workloads (up to 256K). The Thinking FP8 card lists competitive wins on AIME'25, HMMT'25, MMLU-Pro/Redux, and LiveCodeBench v6, claiming improvements over earlier Thinking releases and wins versus Gemini-2.5-Flash-Thinking on several tasks.

Training, stability and post-training signals

The series was trained on about 15T tokens prior to post-training. Qwen notes stability improvements such as zero-centered and weight-decayed layer norm variants, and uses GSPO in RL post-training for the Thinking model to better handle the hybrid attention plus high-sparsity MoE combination. MTP is used both to speed inference and to improve pretraining signal.

Why FP8 matters for long-context MoE

On modern accelerators, FP8 reduces activation and weight memory footprint and bandwidth compared to BF16, enabling larger batches or longer sequences at comparable latency. Because A3B routes only ~3B parameters per token, the combination of FP8 quantization and MoE sparsity compounds throughput gains in long-context regimes, especially when paired with speculative decoding via MTP. However, quantization interacts with routing and attention variants: acceptance rates for speculative decoding and end-task accuracy can vary across engines and kernel implementations. Qwen therefore recommends using current sglang/vLLM builds and tuning speculative settings when deploying FP8.

Practical guidance for deployment

These FP8 releases make the 80B/3B-active A3B stack practical to serve at 256K context on mainstream engines while preserving the hybrid-MoE design and MTP path for high throughput. Because model cards retain BF16 benchmarks, teams should validate FP8 accuracy and latency on their stacks, paying particular attention to reasoning parser flags and speculative decoding parameters.

Additional resources

Qwen points users to their GitHub Page for tutorials, code, and notebooks, and to social channels for updates and community discussion.