Amin: How Optimistic Scheduling Can Make Your LLM Up to 5x Faster
'Amin is an optimistic, adaptive scheduler that uses lower-bound estimates and ordered eviction to cut LLM inference latency dramatically, often matching hindsight-optimal performance and outperforming conservative algorithms by up to 5x.'
The hidden bottleneck in LLM inference
Large language model inference splits into a fast prefill stage for the input and a slower autoregressive decode stage that emits tokens one by one. Input length is known, but output length is uncertain: it might be a single token or hundreds. That uncertainty is the real performance killer on GPUs with limited KV cache memory. To avoid cache overflows, schedulers commonly allocate for the worst case, which leaves GPUs underutilized, batches small, and latency high.
Why pessimism hurts
Conservative schedulers like the benchmark Amax assume each request will reach its predicted upper bound. This prevents crashes at the cost of efficiency: GPU KV caches are filled with fewer jobs, concurrency drops, and latency can balloon. Experiments on datasets such as LMSYS-Chat-1M show Amax suffering sharply as prediction uncertainty grows, sometimes producing latencies up to 5x worse than ideal. With billions of inference requests daily, these inefficiencies translate to substantial extra cost and energy use.
Amin: optimistic scheduling that adapts
Researchers from Peking University, Stanford, and HKUST introduce Amin, an algorithm that starts optimistic and learns on the fly. Instead of reserving space for the maximum predicted length, Amin assumes the minimum predicted length for each request. This maximizes initial batch sizes and packs more requests into the KV cache.
Optimism alone would be risky, but Amin adds two adaptive strategies:
- Dynamic refinement: as tokens are produced, Amin updates each request's pseudo lower bound in real time. If a request has already emitted 100 tokens, its lower bound becomes at least 100, improving future packing decisions.
- Ordered eviction: when memory pressure appears, Amin sorts active jobs by their current pseudo lower bounds and evicts the least-progressed jobs first, breaking ties randomly. This protects jobs closer to completion and reduces wasted work from restarts.
Amin purposely ignores upper bound predictions, because upper bounds are hard to predict reliably while lower bounds are simpler and more stable. The algorithm runs in O(M log M) time per scheduling step, where M is the KV cache size.
Theoretical guarantees and empirical performance
Amin is backed by both proofs and experiments. The team analyzes Amin's competitive ratio versus a hindsight optimal scheduler that knows true output lengths in advance. Amin achieves a O(log(alpha^-1)) competitive ratio, where alpha is the ratio of lower to upper bound and measures prediction uncertainty. By contrast, pessimistic schedulers like Amax can suffer unboundedly bad ratios as uncertainty grows.
For practical distributions the bounds are tight and small:
- Two-point outputs: ratio at most 1.5.
- Geometric outputs: ratio at most 1.7.
- Linearly weighted geometrics: ratio about 1.56.
On numerical tests with samples from LMSYS-Chat-1M, Amin often approaches the theoretical optimum. With very crude predictions, Amin matched the hindsight-optimal scheduler while Amax trailed by about 2x in one reported scenario. Under noisy predictions Amin delivered up to 5x lower latency than Amax in realistic workloads.
What this means for deployments
Amin lets you get far more out of existing models and hardware without touching model weights or changing GPUs. By relying on simple, reliable lower-bound signals and adapting during decoding, it squeezes near-optimal throughput from imperfect predictors. For teams running large-scale inference, adopting Amin-like schedulers can reduce latency, improve throughput, and cut compute costs substantially.
Practical notes
- Amin requires a lower-bound estimate for each request. These are easier to obtain than tight upper bounds.
- The algorithm's eviction policy ensures progress preservation, minimizing wasted token generation.
- Complexity is acceptable for production: O(M log M) per step is efficient for typical KV cache sizes.
If you operate LLM inference at scale, reviewing the paper and the pseudocode can be a quick path to measurable speedups in production pipelines.
Сменить язык
Читать эту статью на русском