Meta CLIP 2: Breaking the Curse of Multilinguality with Worldwide Image‑Text Pretraining

Background

Contrastive Language-Image Pretraining (CLIP) has become a foundation for modern vision and multimodal systems, powering zero-shot image classification and serving as a vision encoder for multimodal large language models. Most CLIP variants, however, are trained on English-centric image-text pairs, leaving extensive non-English content on the web underused and unrepresented.

Key challenges

Two main obstacles block straightforward multilingual scaling. First, there is no efficient, scalable pipeline to curate high-quality non-English image-text pairs across the global web. Second, adding multilingual data often reduces English performance — the so called curse of multilinguality — forcing trade-offs between languages rather than enabling a unified model that excels across them.

What Meta CLIP 2 proposes

Meta CLIP 2, developed by teams at Meta, MIT, Princeton, and NYU, is the first CLIP-style model trained from scratch on native worldwide image-text pairs without relying on private datasets, machine translation, or distillation from English teacher models. The approach removes the English vs non-English performance trade-off by jointly scaling four elements: metadata, curation, model capacity, and training regimen.

Core innovations

Scalable multilingual metadata spanning 300+ languages to capture global content diversity.
A per-language curation algorithm designed to balance concept distribution across languages rather than letting English dominate.
An advanced training framework that extends OpenAI CLIP and Meta CLIP training settings with multilingual support, a multilingual tokenizer, and scaled seen-pair counts.

Training setup and model choices

To maximize generalizability to existing CLIP variants, Meta CLIP 2 follows OpenAI CLIP's architecture and training recipes, with two primary model scales analyzed: ViT-L/14 and ViT-H/14. Experiments indicate that capacity matters: ViT-L/14 struggles with the curse of multilinguality when multilingual data are added, while ViT-H/14 is an inflection point that achieves gains for both English and non-English tasks when trained on scaled worldwide pairs.

Performance highlights

When trained on ViT-H/14 with worldwide data and scaled seen pairs, Meta CLIP 2 surpasses its English-only and non-English-trained counterparts on both English and multilingual benchmarks. For zero-shot ImageNet, the ViT-H/14 model improves from 80.5% to 81.3% accuracy. Scaling from 13B English pairs to 29B worldwide pairs produces consistent improvements across multiple benchmarks such as XM3600, Babel-IN, and CVQA, with some saturation observed on geo-localization datasets like GeoDE.

Practical findings

Removing English filters on alt-texts can hurt English performance slightly (for example a 0.6% drop on ImageNet), underscoring the influence of language isolation in dataset curation.
Replacing English-centric metadata with merged worldwide metadata may temporarily lower English performance but substantially improves multilingual capabilities when paired with larger model capacity and scaled training.
The curse of multilinguality can be mitigated by jointly scaling metadata, per-language curation, and model capacity rather than by relying on distillation or translation.

Open release and implications

Meta CLIP 2 open-sources its metadata, curation pipeline, and training code, enabling researchers to move beyond English-centric CLIP training. The recipe demonstrates that careful multilingual curation and sufficient model capacity allow a single unified CLIP model to perform strongly across languages, expanding the potential of worldwide multimodal web data for vision and multimodal research.

For details, consult the paper and accompanying GitHub resources for tutorials, code, and notebooks.

Meta CLIP 2: Breaking the Curse of Multilinguality with Worldwide Image‑Text Pretraining

Сменить язык