oLLM Lets 100K-Token LLMs Run on 8 GB Consumer GPUs by Offloading Memory to SSDs

September 29, 2025 · 3 min

Overview

oLLM is a lightweight Python library built on Huggingface Transformers and PyTorch that makes very long-context transformer inference possible on single consumer NVIDIA GPUs with only 8–10 GB of VRAM. Instead of quantizing weights, oLLM aggressively offloads large objects like layer weights and the attention KV cache to fast local SSDs, using FP16/BF16 precision and FlashAttention-2 to keep GPU memory bounded while supporting up to roughly 100K tokens of context.

Key innovations

KV cache read/writes that bypass mmap to reduce host RAM usage, minimizing memory pressure on the CPU side.
DiskCache support for Qwen3-Next-80B, enabling a sparse MoE 80B model to be executed on a single GPU at the cost of large SSD storage and low throughput.
Llama-3 FlashAttention-2 integration to improve numerical stability and lower memory peaks.
Memory reductions for GPT-OSS through flash-attention-like kernels and chunked MLP projections to bound peak allocation.

Measured footprints

The maintainer published end-to-end VRAM and SSD footprints on an RTX 3060 Ti (8 GB). Examples include:

Qwen3-Next-80B (bf16, 160 GB weights, 50K ctx) -> ~7.5 GB VRAM + ~180 GB SSD; noted throughput ≈ 1 tok/2 s.
GPT-OSS-20B (packed bf16, 10K ctx) -> ~7.3 GB VRAM + 15 GB SSD.
Llama-3.1-8B (fp16, 100K ctx) -> ~6.6 GB VRAM + 69 GB SSD.

How it works

oLLM streams layer weights from SSD into GPU memory on demand, offloads the attention KV cache to disk, and can optionally offload some layers to CPU RAM. It relies on FlashAttention-2 with online softmax so the full attention matrix is never materialized on the GPU, and it chunks large MLP projections so peak memory use stays bounded. These choices move the primary bottleneck from VRAM to storage bandwidth and latency, which is why oLLM emphasizes NVMe-class SSDs and high-throughput file I/O paths like KvikIO or cuFile (GPUDirect Storage).

Supported models and hardware

Out of the box, example configs and scripts cover Llama-3 (1B/3B/8B), GPT-OSS-20B, and Qwen3-Next-80B. The library targets NVIDIA Ampere (RTX 30xx and A-series), Ada (RTX 40xx, L4), and Hopper architectures. Qwen3-Next requires a development build of Transformers (>= 4.57.0.dev). Qwen3-Next-80B is a sparse MoE with 80B total parameters and roughly 3B active parameters per pass; vendors normally target multi-A100/H100 setups for this model, but oLLM demonstrates an execution path on a single consumer GPU by shifting memory to SSD.

Installation and minimal usage

oLLM is MIT-licensed and available on PyPI via pip install ollm. For high-speed disk I/O, add the kvikio-cu{cuda_version} dependency, and when using Qwen3-Next models install Transformers from GitHub per the project instructions. The repository README shows a short example wiring a DiskCache and a generate call with a streaming text callback. Note that PyPI may list a slightly older release than what the README references.

Performance expectations and trade-offs

Throughput is modest when targeting very long context windows. reported numbers include roughly 0.5 tok/s for Qwen3-Next-80B at 50K context on an RTX 3060 Ti, making this approach appropriate for batch or offline analytics rather than interactive chat. SSD latency and throughput dominate performance. Storage pressure is significant: long contexts produce large KV caches that must be written to and read from SSD to keep VRAM flat. This mirrors other industry approaches to KV offloading but remains storage-bound and workload-specific.

When to use oLLM

Treat oLLM as a pragmatic execution path for offline, large-context tasks such as document analysis, compliance review, and long-context summarization where high precision is desired and latency is acceptable. It is not a drop-in replacement for production multi-GPU serving stacks like vLLM or TGI when high throughput or low latency is required.

Further resources

The project repo contains examples, tutorials, and notebooks to get started. Community channels and project pages provide additional guidance on tuning storage, selecting NVMe hardware, and configuring GPUDirect I/O for best results.