Apple Introduces CLaRa for Enhanced RAG Compression

The Challenge in RAG Systems

How do you keep RAG systems accurate and efficient when every query tries to stuff thousands of tokens into the context window and the retriever and generator are still optimized as two separate, disconnected systems? A team of researchers from Apple and University of Edinburgh released CLaRa, Continuous Latent Reasoning, (CLaRa-7B-Base, CLaRa-7B-Instruct and CLaRa-7B-E2E) — a retrieval-augmented generation framework that compresses documents into continuous memory tokens and then performs both retrieval and generation in that shared latent space. The goal is simple. Shorten context, avoid double encoding, and let the generator teach the retriever what actually matters for downstream answers.

From Raw Documents to Continuous Memory Tokens

CLaRa starts with a semantic compressor that attaches a small number of learned memory tokens to each document. During Salient Compressor Pretraining (SCP), the base model is a Mistral 7B style transformer with LoRA adapters that switch between a compressor role and a generator role. The final layer hidden states of the memory tokens become the compressed representation for that document.

SCP is trained on about 2M passages from Wikipedia 2021. A local Qwen-32B model generates three supervision signals for each passage. Simple QA pairs cover atomic facts, while complex QA pairs enforce multi-hop reasoning. Paraphrases reorder and compress the text while preserving semantics. A verification loop checks factual consistency and coverage for up to 10 rounds before accepting a sample.

Training uses two losses. A cross-entropy term trains the generator to answer questions or produce paraphrases conditioned only on the memory tokens. A mean squared error term aligns the average hidden state of document tokens with that of the memory tokens. The MSE loss offers consistent gains of about 0.3 to 0.6 F1 points at compression ratios of 32 and 128.

Joint Retrieval and Generation in a Shared Space

After offline compression, each document is represented solely by its memory tokens. CLaRa trains a query reasoner and an answer generator on the same backbone. The query reasoner maps an input question into the same number of memory tokens used for documents. Retrieval becomes pure embedding search based on cosine similarity.

The best compressed document embeddings for a query are concatenated with the query tokens and fed into the generator adapter. Training employs a standard next token prediction loss on the final answer without explicit relevance labels. A differentiable top-k selector is implemented, allowing gradients from the generator to flow into the query reasoner parameters.

Compression Quality and QA Accuracy

The compressor is evaluated on four QA datasets: Natural Questions, HotpotQA, MuSiQue, and 2WikiMultihopQA. Under Normal settings, CLaRa at four times compression achieves an average F1 of 39.86, outperforming strong baselines. In Oracle settings, it exceeds 66.76 F1 at the same compression.

End-to-End QA and Retrieval Behavior

CLaRa uses 20 candidate documents per query with compression ratios of 4, 16, and 32. In normal settings, it demonstrates comparable performance to models using full uncompressed text while utilizing significantly shorter document representations.

What Apple Has Released?

Apple's research team released three models on Hugging Face: CLaRa-7B-Base, CLaRa-7B-Instruct, and CLaRa-7B-E2E. Notably, CLaRa-7B-Instruct is a unified RAG model designed to process instruction-style queries from compressed representations.

Key Takeaways

CLaRa replaces raw documents with a small set of continuous memory tokens learned via QA-guided and paraphrase-guided semantic compression.
Retrieval and generation are jointly trained in a single shared latent space.
A differentiable top-k estimator aligns document relevance with answer quality.
CLaRa's SCP compressor outperforms strong text-based baselines.
Apple has made practical models available on Hugging Face, highlighting its dedication to enhancing RAG systems.

Editorial Notes

CLaRa represents a substantial advancement in retrieval-augmented generation, showing that embedding-based compression, paired with end-to-end training, can outperform traditional text-based methods while maintaining efficiency.