DeepSeek V3.2-Exp Cuts Long-Context Costs with Trainable Sparse Attention
DeepSeek has published V3.2-Exp, an intermediate release that targets long-context efficiency by introducing DeepSeek Sparse Attention (DSA). The update preserves the V3/V3.1 stack (MoE + MLA) while adding a two-stage, trainable sparsification path, and the team reports API price cuts of 50%+ consistent with the claimed efficiency gains.
How DSA works
DSA splits attention into two compute tiers. First, a lightweight FP8 “indexer” computes scores for preceding tokens using a few small heads and a ReLU nonlinearity. Because this stage runs in FP8 and uses small head counts, its wall-time and FLOP cost are minor relative to dense attention.
Second, a top-k selection picks a limited number of key-value entries per query (the release uses top-k = 2048). Standard attention is then executed only over that selected subset. This shifts the dominant complexity term from O(L^2) to O(L·k) with k « L, while still allowing queries to attend to distant tokens when necessary.
Training and implementation details
The indexer is trained to imitate the dense model’s head-summed attention distribution using KL-divergence. Training proceeds in two phases: a short dense warm-up where the indexer learns targets while the main model is frozen, followed by sparse training where the indexer’s gradients are separated from the main language loss. The warm-up used roughly 2.1B tokens; the sparse stage ran on ~943.7B tokens with top-k=2048 and a main-model learning rate around 7.3e-6.
DSA is implemented under MLA in MQA decoding mode so that latent KV entries are shared across query heads, which aligns with the kernel-level requirement for KV reuse across queries for throughput. The authors reference TileLang, DeepGEMM (indexer logits), and FlashMLA (sparse kernels) as supporting open-source kernel work. Repository and technical details are available here: https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf
Efficiency, benchmarks and operational signals
On H800 clusters (reference price $2/GPU-hour) DeepSeek provides per-million-token cost curves for prefill and decode. Reported decode costs fall substantially with DSA, while prefill benefits from a masked MHA simulation at shorter lengths. Social posts circulating an “~83%” figure map to DeepSeek’s claim of roughly 6× cheaper decode at 128k, but the release notes recommend treating that as vendor-reported until independent replication under matched batching and cache policies is available.
Benchmark parity is a central claim. The release shows MMLU-Pro at 85.0 (unchanged) and small movements on a few reasoning metrics; agentic/search tasks see flat or positive changes (e.g., BrowseComp 40.1 vs 38.5). The authors note that intermediate checkpoints that produce comparable token counts close gaps.
Operationally, day-0 support in SGLang and vLLM suggests the kernels and scheduler changes are production-focused. DeepSeek also cut API prices by 50%+, consistent with efficiency messaging and broader press coverage.
Practical takeaway
V3.2-Exp demonstrates that trainable sparsity can materially improve long-context economics while maintaining benchmark parity. For teams using RAG or long-document pipelines where O(L^2) attention is a dominant cost, V3.2-Exp is worth A/B testing as a drop-in option — but validate end-to-end throughput, batching, and quality on your stack before migrating.