IBM Unveils Two Compact ModernBERT-Based Granite Embedding Models for Long-Context Retrieval
IBM releases two Granite R2 embedding models
IBM AI Research has published two new embedding models, granite-embedding-english-r2 and granite-embedding-small-english-r2, aimed at high-performance retrieval and RAG systems. Both models are compact, efficient, and released under the Apache 2.0 license, which makes them suitable for commercial deployment.
Models and specifications
The two models target different compute and latency budgets:
- granite-embedding-english-r2: 149 million parameters, 768-dimensional embeddings, built on a 22-layer ModernBERT encoder.
- granite-embedding-small-english-r2: 47 million parameters, 384-dimensional embeddings, built on a 12-layer ModernBERT encoder.
Despite the size difference, both models support a maximum context length of 8192 tokens, a major step up from the first-generation Granite embeddings and a key advantage for enterprise workloads that involve long documents and complex retrieval queries.
Reference: https://arxiv.org/abs/2508.21085
Architecture and training pipeline
Both Granite R2 models use the ModernBERT backbone with several engineering optimizations:
- Alternating global and local attention to balance efficiency with long-range dependency modeling.
- Rotary positional embeddings (RoPE) tuned for positional interpolation, enabling longer context windows.
- FlashAttention 2 to improve memory efficiency and inference throughput.
Training followed a multi-stage pipeline: masked language pretraining on a two-trillion-token corpus drawn from web data, Wikipedia, PubMed, BookCorpus, and internal IBM technical documents; context extension from 1k to 8k tokens; contrastive learning with distillation from Mistral-7B; and domain-specific tuning for conversational, tabular, and code retrieval tasks.
Benchmark performance and domain strengths
The Granite R2 models show competitive results across standard retrieval benchmarks. On MTEB-v2 and BEIR, the larger granite-embedding-english-r2 outperforms similarly sized models such as BGE Base, E5, and Arctic Embed. The smaller granite-embedding-small-english-r2 achieves accuracy close to models two to three times larger, making it attractive for latency-sensitive deployments.
Specialized domains where these models excel include:
- Long-document retrieval (e.g., MLDR, LongEmbed) where 8k context support matters.
- Table retrieval tasks (e.g., OTT-QA, FinQA, OpenWikiTables) requiring structured reasoning.
- Code retrieval (CoIR), handling both text-to-code and code-to-text queries.
Reference: https://arxiv.org/abs/2508.21085
Throughput and deployment considerations
Efficiency is a central strength of the Granite R2 models. On an Nvidia H100 GPU, granite-embedding-small-english-r2 encodes nearly 200 documents per second, which is significantly faster than BGE Small and E5 Small. The larger granite-embedding-english-r2 reaches around 144 documents per second, outperforming many ModernBERT-based alternatives.
These models are also practical on CPU-only environments, enabling enterprises to run retrieval workloads without heavy GPU dependence. The balance of speed, compact size, and retrieval accuracy makes Granite R2 well suited for real-world production systems.
Practical implications for retrieval and RAG
Granite R2 demonstrates that strong retrieval performance and long-context capability do not require extremely large parameter counts. With Apache 2.0 licensing and production-friendly throughput, these embeddings are a viable alternative for companies building RAG workflows, search pipelines, and knowledge management systems. For organizations prioritizing long-document support, latency, and commercial readiness, Granite R2 is worth evaluating.