mmBERT Unveiled: 3T Tokens, 1,833 Languages, and a 2–4× Speed Boost for Multilingual Encoding

Why a new multilingual encoder mattered

XLM-RoBERTa dominated multilingual encoder research for more than five years. Meanwhile the field moved toward decoder-based generative models, even though encoder-only networks remain more efficient and often perform better on embedding, retrieval, and classification tasks. Development of modern multilingual encoders lagged behind, creating room for a contemporary alternative.

Core architecture and configurations

mmBERT ships in two main sizes. The base variant has 22 transformer layers, a hidden dimension of 1152, and about 307 million parameters (110 million non-embedding). The small variant totals roughly 140 million parameters (42 million non-embedding).

Key architectural choices include the Gemma 2 tokenizer with a 256k vocabulary, rotary positional embeddings (RoPE), and FlashAttention2 to speed up attention. Sequence length support is extended from 1024 to 8192 tokens by combining unpadded embeddings with a sliding-window attention mechanism. These changes let mmBERT process contexts nearly an order of magnitude longer than XLM-R while remaining faster at inference.

Training data and staged pretraining

The model was trained on about 3 trillion tokens spanning 1,833 languages. Sources include FineWeb2, Dolma, MegaWika v2, ProLong, StarCoder, and others. English occupies only a fraction of the total corpus, roughly 10 to 34 percent depending on training phase.

Pretraining proceeded in three phases:

Novel training strategies

Three main training innovations drive mmBERTs performance:

Benchmark performance

mmBERT shows strong results across tasks:

Low-resource language handling

The annealed language schedule ensures low-resource languages receive increasing emphasis in later training phases. On benchmarks for severely low-resource languages, such as Faroese FoQA and Tigrinya TiQuAD, mmBERT outperforms large decoder models like o3 and Gemini 2.5 Pro. This demonstrates that carefully trained encoder models can generalize effectively even for languages with minimal data.

Efficiency and practical gains

mmBERT achieves 2–4× speedups compared to XLM-R and MiniLM while supporting 8192-token inputs. It remains faster at 8192 tokens than older encoders were at 512 tokens. Efficiency gains come from the ModernBERT training recipe, optimized attention implementations, and streamlined embeddings, enabling longer context handling without sacrificing inference throughput.

Practical takeaways

mmBERT provides an open, efficient, and scalable replacement for older multilingual encoders. With a training recipe spanning 3 trillion tokens, annealed language schedules, inverse masking, and model merging, it delivers broad generalization across high-resource and low-resource languages while substantially improving inference speed and context length.