<RETURN_TO_BASE

NVIDIA Unveils Dynamic Memory Sparsification for 8× Compression of Transformer KV Caches

NVIDIA researchers developed Dynamic Memory Sparsification (DMS), a novel method that compresses KV caches by 8× in Transformer-based LLMs, improving inference efficiency while maintaining accuracy.

The Challenge of KV Cache in Large Language Models

Transformer-based large language models (LLMs) like GPT, LLaMA, and Qwen rely heavily on key–value (KV) caches to store past token representations for autoregressive text generation. However, as these models handle longer sequences or multiple parallel reasoning chains, the KV cache grows linearly with the sequence length and parallelism width. This results in large GPU memory consumption and slows down inference due to frequent memory accesses.

Limitations of Existing KV Cache Optimization Methods

Current optimization techniques either evict tokens based on heuristics such as attention weights, which can reduce model accuracy, or employ heavy post-training retrofits like Dynamic Memory Compression (DMC), which are computationally expensive.

Introducing Dynamic Memory Sparsification (DMS)

Researchers from NVIDIA and the University of Edinburgh have developed Dynamic Memory Sparsification (DMS), a data-efficient and retrofit-friendly method to compress KV caches without sacrificing accuracy. DMS sparsifies the KV cache similarly to pruning methods but with minimal training overhead (around 1,000 steps). It uses a delayed eviction strategy, keeping tokens temporarily after being marked for removal, preserving important context and preventing sudden accuracy drops.

The eviction decisions are made differentiable during training using a Gumbel-sigmoid sampling mechanism. Tokens predicted for eviction remain usable within a sliding window duration before final removal, allowing the model to better absorb their information.

Efficient Retrofitting and Minimal Training

Unlike Dynamic Memory Compression, DMS requires no extra parameters per attention head. It leverages a small component of the attention mechanism (a single neuron) to predict token eviction, making it ideal for retrofitting existing models without architectural changes.

Performance and Benchmarking

With as few as 1,000 training steps, DMS achieves up to 8× compression of KV caches while preserving or enhancing performance on reasoning tasks. Benchmarks include:

  • AIME 2024 (advanced mathematics)
  • MATH 500 (mathematical problem solving)
  • GPQA Diamond (hard science question answering)
  • LiveCodeBench (code generation)

Across various model sizes (Qwen-R1 1.5B, 7B, and 32B), DMS improved exact-match scores significantly (e.g., 9.1 points on AIME). Compared to top baselines like Quest and TOVA, DMS offers superior KV cache read efficiency and lower peak memory usage.

Versatility Across Tasks

DMS also performs well on short-context benchmarks such as MMLU, GSM8K, and HellaSwag, maintaining performance at up to 4× compression with minimal degradation (~3.5 points). For long-context tasks like Needle-in-a-Haystack and Variable Tracking, DMS even outperformed the original models, suggesting it mitigates information loss in long sequences.

Implications for Real-World Deployment

Dynamic Memory Sparsification offers a scalable and practical approach to improve inference efficiency in Transformer LLMs. By compressing the KV cache intelligently with minimal retraining, it enables longer or parallel reasoning sequences without increasing runtime or memory demands. This balance of compression, accuracy, and ease of integration makes DMS a promising solution for deploying LLMs in resource-constrained environments.

For more details, refer to the official research paper by NVIDIA and the University of Edinburgh.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский