<RETURN_TO_BASE

LLM Inference Showdown: vLLM vs TensorRT-LLM vs HF TGI v3 vs LMDeploy

'A concise technical comparison of vLLM, TensorRT-LLM, Hugging Face TGI v3 and LMDeploy, highlighting when to use each stack for production LLM inference based on throughput, latency and KV behavior.'

Production LLM serving is a systems engineering problem where the inference stack determines tokens per second, tail latency, and ultimately cost on a GPU fleet. Below is a focused technical comparison of four common stacks: vLLM, NVIDIA TensorRT-LLM, Hugging Face Text Generation Inference v3, and LMDeploy.

vLLM — PagedAttention as an open baseline

Core idea

vLLM centers on PagedAttention, which treats the KV cache like paged virtual memory instead of a single contiguous buffer per sequence. Rather than allocating one large KV region per request, vLLM:

  • Divides KV cache into fixed-size blocks
  • Maintains a block table mapping logical tokens to physical blocks
  • Shares blocks between sequences where prefixes overlap

This reduces external fragmentation and lets the scheduler pack many more concurrent sequences into the same VRAM.

Throughput and latency

vLLM commonly improves throughput by 2–4× versus older systems like FasterTransformer and Orca at similar latency, with larger gains for longer sequences. Key operator properties include continuous (inflight) batching that merges incoming requests into existing GPU batches. On chat workloads throughput scales near-linearly with concurrency until KV memory or compute becomes the bottleneck. P50 latency stays low at moderate concurrency, while P99 latency can increase when queues lengthen or KV memory is tight, especially for prefill-heavy queries.

KV and multi-tenant behavior

PagedAttention yields near-zero KV waste and flexible prefix sharing within and across requests. Each vLLM process serves one model, and multi-model or multi-tenant deployments are typically built with an external router or API gateway that fans out to multiple vLLM instances. vLLM exposes an OpenAI-compatible HTTP API and integrates with Ray Serve and common orchestrators, which contributes to its role as the open baseline.

TensorRT-LLM — NVIDIA hardware-optimized maximum

Core idea

TensorRT-LLM is NVIDIA's inference library optimized for their GPUs, offering custom attention kernels, inflight batching, paged KV caching, quantization down to FP4/INT4, and speculative decoding. It is tightly coupled to NVIDIA hardware features including FP8 tensor cores on Hopper and Blackwell architectures.

Measured performance

Public NVIDIA numbers show H100 with FP8 reaching over 10,000 output tokens/s at peak throughput for 64 concurrent requests, with roughly 100 ms time to first token. H100 FP8 can achieve up to 4.6× higher max throughput and 4.4× faster first-token latency than A100 on the same models. For latency-sensitive modes TensorRT-LLM on H100 can push time-to-first-token below 10 ms in batch-1 configurations at the expense of overall throughput. Exact figures remain model- and shape-dependent, but these references give realistic scale expectations.

Prefill vs decode

TensorRT-LLM optimizes both prefill and decode: prefill benefits from high-throughput FP8 attention kernels and tensor parallelism; decode benefits from CUDA graphs, speculative decoding, quantized weights and KV, and kernel fusion. The combined effect is very high tokens/s across varied input and output lengths when the engine is tuned for the target model and batch profile.

KV and multi-tenant considerations

TensorRT-LLM provides paged KV cache with configurable layouts, long-sequence support, KV reuse and offloading, inflight batching, and priority-aware scheduling. NVIDIA typically pairs the engine with Ray- or Triton-based orchestration patterns for multi-tenant clusters. Multi-model routing is handled at the orchestrator layer rather than inside a single TensorRT-LLM engine instance.

Hugging Face Text Generation Inference (TGI) v3 — long-prompt specialist and multi-backend gateway

Core idea

TGI is a Rust and Python serving stack providing HTTP and gRPC APIs, continuous batching, observability and autoscaling hooks, and pluggable backends (including vLLM-style engines and TensorRT-LLM). Version 3 emphasizes long-prompt processing via chunking and prefix caching.

Long prompt performance

Hugging Face reports dramatic gains for very long prompts: an example where a conversation reply takes 27.5 s in vLLM can be served in about 2 s in TGI v3, a reported 13× speedup on that workload. TGI v3 can process roughly 3× more tokens in the same GPU memory by reducing its memory footprint and exploiting chunking and caching. The mechanism keeps the original conversation context in a prefix cache so subsequent turns only incur incremental token costs; cache lookup overhead is on the order of microseconds.

Architecture and latency behavior

TGI v3 uses chunking to split very long prompts into manageable segments, prefix caching to share long context across turns, continuous batching to merge requests into already running batches, and leverages paged-attention and fused kernels in GPU backends. For short chat-style workloads throughput and latency are comparable to vLLM, while for long, cacheable contexts both P50 and P99 can improve by orders of magnitude because repeated prefill is avoided.

Multi-backend and multi-model routing

TGI is designed as a router plus model-server architecture, able to route requests across many models and target different backends (for example TensorRT-LLM on H100 for high-priority traffic and CPU or smaller GPUs for low-priority). This makes it a strong choice for a central serving tier in multi-tenant environments.

LMDeploy — TurboMind with blocked KV and aggressive quantization

Core idea

LMDeploy, from the InternLM ecosystem, focuses on compressing and serving LLMs via the TurboMind engine. It emphasizes high request throughput with blocked KV cache, persistent batching, and weight and KV quantization.

Relative throughput and latency

LMDeploy claims up to 1.8× higher request throughput than vLLM thanks to persistent batching, blocked KV, dynamic split-and-fuse, tensor parallelism, and optimized CUDA kernels. Blocked KV cache helps pack many sequences into VRAM, and KV cache quantization (int8 or int4) reduces KV memory and bandwidth. LMDeploy also supports weight-only 4-bit quantization formats like AWQ. The project includes a benchmarking harness that reports token throughput, request throughput, and first-token latency.

Multi-model deployments

LMDeploy provides a proxy server for multi-model, multi-machine, multi-GPU setups with routing logic to select models based on request metadata, positioning it architecturally closer to TGI than to a single-engine deployment model.

Choosing the right stack

  • If maximum tokens/s and minimal TTFT on NVIDIA GPUs are your priority: TensorRT-LLM is the primary choice, leveraging FP8 and custom kernels to push throughput and to reach sub-100 ms TTFT at high concurrency or sub-10 ms at low concurrency.
  • If your workload is dominated by very long prompts with reuse (RAG, long analytic summarization): TGI v3 is a strong default, with prefix caching and chunking delivering large memory savings and major latency reductions relative to vLLM in published examples.
  • If you prefer an open, simple engine with an OpenAI-style API and strong baseline performance: vLLM remains the standard open baseline, with PagedAttention and continuous batching yielding 2–4× improvements over older stacks.
  • If you target open models like InternLM or Qwen and want aggressive quantization with multi-model serving: LMDeploy is a good fit, offering blocked KV, persistent batching, and int8/int4 KV quantization to increase request throughput.

Many production environments combine these systems by workload: TensorRT-LLM for high-volume proprietary chat, TGI v3 for long-context analytics, and vLLM or LMDeploy for experimental and open-model workloads. The crucial step is to align throughput, latency tails, and KV behavior with your actual token distributions and compute cost per million tokens from measured tokens/s on your hardware.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский