Why KV cache wastes GPU memory

Large language model serving often ends up reserving large, static regions of GPU memory for per model KV caches. Engines pre-allocate contiguous KV regions even when traffic is bursty or models sit idle. The result is stranded memory, slower activations when a model must be brought online, and poor device utilization for clusters that host many models.

What kvcached introduces

kvcached is a library developed by Berkeley’s Sky Computing Lab in collaboration with Rice University, UCLA and contributors from industry research groups. It presents an OS style virtual memory abstraction for the KV cache used by LLM inference engines. Instead of forcing engines to back a full KV allocation with physical GPU pages up front, kvcached lets engines reserve contiguous virtual address ranges and then map only the active portions to physical GPU pages on demand using CUDA virtual memory APIs.

How it works in practice

An engine using kvcached creates a KV cache pool that appears contiguous in the virtual address space. As tokens arrive and the working set grows, the library lazily maps physical GPU pages at a fine granularity. When requests finish or models go idle, pages are unmapped and returned to a shared pool so other colocated models can reuse them immediately. This design preserves simple pointer arithmetic inside kernels and removes the need for engine-level user paging logic.

kvcached relies on CUDA VMM capabilities for mapping and unmapping GPU pages and targets integration with popular inference engines such as SGLang and vLLM. The project is available under the Apache 2.0 license and includes installation instructions and a one command quick start in its GitHub repository.

Impact at scale

Real production workloads host many models with long tail traffic and spiky bursts. Static reservations cause memory to be stranded and slow down time to first token (TTFT) when inactive models must be activated or swapped. Recent research shows that multi-LLM serving needs cross model memory coordination at runtime, not just compute scheduling. Prism, a related project, implements on demand mapping plus a two level scheduler and reports more than 2x cost savings and 3.3x higher TTFT SLO attainment compared with prior systems on real traces. kvcached focuses on providing the memory coordination primitive so mainstream engines can adopt it without heavy rewrites.

Performance signals

The kvcached team reports TTFT improvements ranging from 1.2x up to 28x in multi model serving scenarios. These gains come from immediate reuse of freed pages and eliminating large static allocations that would otherwise block memory for long periods. The largest benefits appear when activation latency and memory headroom control tail latency in colocated, bursty workloads.

Practical applications for developers

Colocation across models: Engines can colocate several small or medium models on the same device. When one model goes idle its KV pages free quickly and another model can expand its working set without restarting, reducing head of line blocking during bursts.
Faster activation: Virtual reservations let engines prepare address ranges in advance and map pages as tokens arrive, reducing activation latency for cold or infrequently used models.
Serverless autoscaling: Fine grained page mapping makes it feasible to spin replicas up and down more frequently and to keep cold models in a warmed state with minimal memory footprint, enabling tighter autoscaling loops.
Offload and compaction: Virtual memory enables future directions like offloading KV pages to host memory or NVMe under favorable access patterns. Throughput and latency effects depend strongly on access locality and PCIe or NVLink topology, so validate in your environment.

Integration and adoption

kvcached supplies a reusable component that mainstream engines can integrate to gain elastic KV behavior. It preserves kernel pointer assumptions while enabling dynamic memory reclamation and reuse across models. The project is open source, and documentation and quickstart instructions are available in the repository for teams that want to try it in production.

Key takeaways

kvcached virtualizes the KV cache via GPU virtual memory so engines reserve contiguous virtual space and map physical pages on demand. This enables elastic allocation and reclamation under dynamic loads, improves multi model colocation, reduces time to first token, and lowers cost when compared with static reservations. For clusters with many models and bursty traffic, virtualized KV caching makes colocation safer, activations faster, and autoscaling tighter, while remaining compatible with engines like SGLang and vLLM.

References and resources are available in the project repository and the accompanying research papers for teams that want to evaluate kvcached in their own serving pipelines.

kvcached Unlocks Elastic KV Caching to Slash GPU Memory Waste for LLMs