LEANN: A Tiny, Storage-Smart ANN Index for Personal AI
'LEANN is a storage-efficient ANN index that recomputes embeddings on demand to cut index size below 5% while keeping strong accuracy and fast query times.'
Why storage-efficient vector search matters
Embedding-based search beats keyword methods by capturing semantic similarity with dense vectors and approximate nearest neighbor (ANN) search. The downside is storage: many ANN indexes carry 1.5–7× overhead compared to raw data. That overhead is acceptable for large cloud services but prohibits real-world deployment on personal devices or for very large local datasets. Bringing index storage below 5% of the original data size is crucial for edge and personal AI, but existing compression techniques either hurt accuracy or increase latency.
The LEANN approach
LEANN (a storage-efficient ANN index) was developed to address this gap. Instead of persistently storing full embeddings for every dataset item, LEANN combines a compact graph-based index with on-the-fly recomputation of embeddings only for nodes actually visited during search. The design rests on two main ideas:
- Two-level graph traversal with dynamic batching: this lowers recomputation latency by grouping embedding computations across search hops and improving GPU utilization.
- High-preservation graph pruning: this sharply reduces metadata size while keeping the graph connectivity needed for accurate search.
Built on an HNSW-style graph backbone, LEANN assumes that any query touches only a small subset of nodes. By computing embeddings on demand for those nodes and recomputing as needed (rather than storing everything), LEANN drastically trims storage requirements.
Performance highlights
In experiments across real-world question-answering benchmarks, LEANN reduces index storage to under 5% of the raw dataset size — up to 50× smaller than standard indexes. It still achieves strong retrieval accuracy, reaching about 90% top-3 recall in under 2 seconds on QA benchmarks. To reduce latency further, the two-level traversal plus dynamic batching combine to produce faster searches and better GPU throughput.
LEANN also compares favorably to other recomputation approaches such as EdgeRAG (an IVF-based recomputation system). Thanks to LEANN's polylogarithmic recomputation complexity, latency improvements over EdgeRAG ranged from about 21× to 200× depending on dataset and hardware. Accuracy for downstream retrieval-augmented-generation (RAG) tasks improved on most datasets, with a few exceptions where dataset distribution or task setup limited gains (for example GPQA and single-hop HotpotQA setups).
Trade-offs and future directions
LEANN makes notable trade-offs. While it minimizes persistent storage, peak memory and storage usage during index construction can still be high. The authors suggest pre-clustering or other construction-time optimizations to reduce that peak. There is also room to further cut latency and improve responsiveness under strict device constraints.
Overall, LEANN demonstrates that careful design — on-demand recomputation, graph pruning, and batching-aware traversal — can make ANN search feasible on resource-limited devices without sacrificing much accuracy. The paper and accompanying GitHub repository provide code, tutorials, and experiments for those who want to try LEANN on personal AI setups.
Сменить язык
Читать эту статью на русском