Battle of the Runtimes: Top 6 LLM Inference Engines to Use in 2025
'A practical comparison of the top 6 LLM inference runtimes in 2025, highlighting design tradeoffs, KV cache strategies, performance profiles, and where each engine fits in production.'
Why inference runtimes matter
Large language models are increasingly bounded by how quickly and cheaply we can serve tokens under real traffic. The critical implementation choices are how a runtime batches requests, how it overlaps the prefill and decode phases, and how it stores and reuses the KV cache. Those decisions directly affect tokens per second, P50/P99 latency, and GPU memory footprint.
What to look for in an inference engine
Three axes define the practical behavior of an engine: batching strategy, overlap of prefill and decode, and KV cache representation and reuse. Different runtimes trade off these axes differently, so pick the engine that matches your workload profile: short latency-sensitive requests, long-chat with heavy prefix reuse, extreme throughput on quantized models, or the ability to run very large models with offload.
vLLM
Design vLLM centers on PagedAttention. Instead of allocating a single contiguous KV buffer per sequence, it breaks KV into fixed-size blocks and uses an indirection layer so each sequence points to a list of blocks.
Why it helps
- Very low KV fragmentation (reported under ~4% waste versus 60–80% for naïve allocators).
- High GPU utilization from continuous batching.
- Native block-level prefix sharing and KV reuse.
- Recent additions include FP8 KV quantization and FlashAttention-style kernels.
Performance and fit vLLM reports 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs in published evaluations. It is a strong default high-performance engine when you want a general LLM serving backend with good throughput, solid TTFT (time to first token), and hardware flexibility.
TensorRT LLM
Design TensorRT LLM is a compilation-based engine built on NVIDIA TensorRT. It generates fused kernels specific to model and shape, and exposes an executor API used by systems like Triton.
KV and features Its KV subsystem is explicit and flexible: paged KV cache, quantized KV options (INT8, FP8), circular buffer modes, KV reuse with CPU offload, and APIs to design cache-aware routing. NVIDIA reports CPU-based KV reuse can reduce time to first token by up to 14× on H100 and more on GH200 in some scenarios.
Performance and fit When tuned per model, TensorRT delivers very low single-request latency and strong throughput on NVIDIA hardware. It suits latency-critical workloads and NVIDIA-only environments where teams can invest in per-model engine builds and tuning.
Hugging Face Text Generation Inference (TGI v3)
Design TGI is a server-focused stack with a Rust HTTP and gRPC server, continuous batching and streaming, safety hooks, and backends for PyTorch and TensorRT. TGI v3 adds a long-context pipeline: chunked prefill and prefix KV caching to avoid recomputing long histories.
Behavior and strengths For conventional prompts, vLLM often slightly outperforms TGI on raw tokens per second at high concurrency, but TGI v3 shines on very long prompts. In third-party tests TGI v3 processed around 3× more tokens and was up to 13× faster than vLLM on workloads with long histories and prefix caching enabled. For chat-style workloads with long histories, TGI v3 can produce much lower TTFT and improved P50.
Where it fits Production stacks already on Hugging Face, especially chat workloads with long histories where prefix caching yields large real-world gains.
LMDeploy
Design LMDeploy provides TurboMind CUDA kernels for NVIDIA GPUs and a PyTorch fallback. Key runtime features are persistent continuous batching, a blocked KV cache manager, dynamic split/fuse for attention blocks, tensor parallelism, and support for weight-only and KV quantization (including AWQ and online INT8 / INT4 KV quant).
Performance and memory Vendor tests report up to 1.8× higher throughput than vLLM for some workloads. For 4-bit LLaMA-style models on A100, LMDeploy has shown higher tokens per second under comparable latency constraints. The blocked KV design trades contiguous per-sequence buffers for a managed grid of KV chunks, and the runtime targets maximum throughput on NVIDIA hardware.
Where it fits NVIDIA-centric deployments aiming for maximum throughput and comfortable with TurboMind tooling and engine-specific optimizations.
SGLang
Design SGLang is both a DSL for structured LLM programs (agents, RAG workflows, tool pipelines) and a runtime that implements RadixAttention. RadixAttention stores KV in a prefix tree keyed by tokens, which enables high KV hit rates when many calls share prefixes.
Performance and advantages SGLang reports up to 6.4× higher throughput and up to 3.7× lower latency than baseline systems like vLLM and LMQL on structured workloads with heavy prefix reuse. Reported KV hit rates range from ~50% to 99%, and cache-aware schedulers can approach optimal hit rates.
Where it fits Agentic systems, tool chains, and RAG applications with lots of shared prompt prefixes where application-level KV reuse is critical.
DeepSpeed Inference and ZeRO Inference
Design DeepSpeed offers optimized transformer kernels and parallelism for inference, while ZeRO Inference / ZeRO Offload enable offloading model weights and, in some setups, KV to CPU or NVMe. This lets very large models run on limited GPU memory.
Performance characteristics Offload enables models that do not fit on GPU. For example, full CPU offload of an OPT 30B on a single V100 32GB reached ~43 tokens per second, and full NVMe offload ~30 tokens per second. These are lower than GPU-resident runtimes on A100/H100, but they allow larger batch sizes and fit models that otherwise cannot be served.
Where it fits Offline or batch inference, low QPS services, or any scenario where model size matters more than latency and running big models on limited GPU memory is required.
Practical guidance for choosing a runtime
- Need a solid default with good throughput and general hardware support: start with vLLM.
- Committed to NVIDIA and need fine-grained latency control: TensorRT LLM, possibly behind Triton.
- Already on Hugging Face, and chat workloads dominate: TGI v3 for long-context prefix caching.
- Maximize throughput per GPU with quantized models (4-bit): LMDeploy with TurboMind.
- Building agents, pipelines, or heavy RAG: SGLang to exploit RadixAttention and prefix reuse.
- Must run very large models on small GPUs: DeepSpeed Inference / ZeRO with offload, accepting higher TTFT.
All engines converge on one insight: KV cache handling is the bottleneck. The best runtimes treat KV as a first-class data structure to page, quantize, reuse, and offload, not merely a large tensor stuffed into GPU memory.
Сменить язык
Читать эту статью на русском