mmBERT Unveiled: 3T Tokens, 1,833 Languages, and a 2–4× Speed Boost for Multilingual Encoding
Why a new multilingual encoder mattered
XLM-RoBERTa dominated multilingual encoder research for more than five years. Meanwhile the field moved toward decoder-based generative models, even though encoder-only networks remain more efficient and often perform better on embedding, retrieval, and classification tasks. Development of modern multilingual encoders lagged behind, creating room for a contemporary alternative.
Core architecture and configurations
mmBERT ships in two main sizes. The base variant has 22 transformer layers, a hidden dimension of 1152, and about 307 million parameters (110 million non-embedding). The small variant totals roughly 140 million parameters (42 million non-embedding).
Key architectural choices include the Gemma 2 tokenizer with a 256k vocabulary, rotary positional embeddings (RoPE), and FlashAttention2 to speed up attention. Sequence length support is extended from 1024 to 8192 tokens by combining unpadded embeddings with a sliding-window attention mechanism. These changes let mmBERT process contexts nearly an order of magnitude longer than XLM-R while remaining faster at inference.
Training data and staged pretraining
The model was trained on about 3 trillion tokens spanning 1,833 languages. Sources include FineWeb2, Dolma, MegaWika v2, ProLong, StarCoder, and others. English occupies only a fraction of the total corpus, roughly 10 to 34 percent depending on training phase.
Pretraining proceeded in three phases:
- Pre-training: 2.3 trillion tokens across 60 languages plus code.
- Mid-training: 600 billion tokens across 110 languages focusing on higher quality sources.
- Decay phase: 100 billion tokens covering 1,833 languages to emphasize low-resource adaptation.
Novel training strategies
Three main training innovations drive mmBERTs performance:
- Annealed Language Learning (ALL): Languages are introduced gradually from 60 to 110 to 1,833. Sampling distributions are annealed from high-resource bias toward uniform sampling, allowing low-resource languages to gain influence later without overfitting.
- Inverse Masking Schedule: Masking ratio begins at 30 percent and decays to 5 percent, encouraging coarse-grained learning early and finer-grained refinements later.
- Model Merging Across Decay Variants: Multiple decay-phase models with complementary focuses (English-heavy, 110-language, and 1,833-language variants) are merged via TIES merging to combine strengths without full retraining.
Benchmark performance
mmBERT shows strong results across tasks:
- English NLU (GLUE): mmBERT base reaches 86.3, outperforming XLM-R (83.3) and approaching ModernBERT (87.4), despite most pretraining being non-English.
- Multilingual NLU (XTREME): mmBERT base scores 72.8 versus XLM-R 70.4, improving classification and QA.
- Embeddings (MTEB v2): mmBERT base ties ModernBERT in English and leads in multilingual metrics.
- Code retrieval (CoIR): mmBERT outperforms XLM-R by roughly nine points, although some proprietary models may still lead on specialized data.
Low-resource language handling
The annealed language schedule ensures low-resource languages receive increasing emphasis in later training phases. On benchmarks for severely low-resource languages, such as Faroese FoQA and Tigrinya TiQuAD, mmBERT outperforms large decoder models like o3 and Gemini 2.5 Pro. This demonstrates that carefully trained encoder models can generalize effectively even for languages with minimal data.
Efficiency and practical gains
mmBERT achieves 2–4× speedups compared to XLM-R and MiniLM while supporting 8192-token inputs. It remains faster at 8192 tokens than older encoders were at 512 tokens. Efficiency gains come from the ModernBERT training recipe, optimized attention implementations, and streamlined embeddings, enabling longer context handling without sacrificing inference throughput.
Practical takeaways
mmBERT provides an open, efficient, and scalable replacement for older multilingual encoders. With a training recipe spanning 3 trillion tokens, annealed language schedules, inverse masking, and model merging, it delivers broad generalization across high-resource and low-resource languages while substantially improving inference speed and context length.