<RETURN_TO_BASE

Meta AI Unveils Web-SSL: Scaling Vision Learning Without Language

Meta AI introduces Web-SSL, a family of large-scale visual self-supervised models trained without language supervision. These models achieve competitive results on multimodal benchmarks, challenging the need for language in vision learning.

Challenging Language Dependency in Vision Models

Contrastive language-image models like CLIP have dominated vision representation learning by leveraging large-scale paired image-text data. However, this reliance on language introduces challenges around data acquisition, assumptions about language necessity, and scalability. Visual self-supervised learning (SSL) offers an alternative by learning solely from images without language supervision, but has seen limited use in multimodal reasoning due to performance gaps, especially in OCR and chart interpretation tasks.

Introducing Web-SSL Models

Meta has launched the Web-SSL family of models, including DINO and Vision Transformer (ViT) architectures, scaling from 300 million to 7 billion parameters. These models are trained exclusively on the image subset of the MetaCLIP dataset (MC-2B) containing two billion images, enabling a direct comparison to CLIP trained on identical data but with language supervision. This release aims to rigorously evaluate the potential of pure visual self-supervision at scale.

Architecture and Training

WebSSL employs two SSL paradigms: joint-embedding learning (DINOv2) and masked modeling (MAE). Models use 224×224 images during training and keep vision encoders frozen during downstream tasks to isolate pretraining effects. Training spans five model sizes (ViT-1B to ViT-7B) using unlabeled MC-2B images. Evaluation relies on Cambrian-1, a 16-task benchmark suite covering vision understanding, knowledge reasoning, OCR, and chart interpretation. These models are integrated into Hugging Face’s transformers library for easy access.

Performance Highlights

  • Scaling Benefits: WebSSL scales nearly log-linearly in VQA performance with model size, unlike CLIP which plateaus beyond 3 billion parameters.
  • Data Composition: Filtering training images to 1.3% text-rich samples allows WebSSL to surpass CLIP by up to 13.6% on OCR and chart tasks, emphasizing the importance of visual text over language labels.
  • High-Resolution Training: Fine-tuning at 518px resolution narrows the gap with high-res models like SigLIP for document tasks.
  • Emergent Language Alignment: Despite no language supervision, larger WebSSL models better align with pretrained language models (e.g., LLaMA-3), suggesting implicit semantic learning.
  • Robust Benchmark Results: WebSSL performs strongly on classification (ImageNet-1k), segmentation (ADE20K), and depth estimation (NYUv2), often outperforming MetaCLIP and DINOv2 in similar settings.

Implications for Multimodal Learning

The Web-SSL release demonstrates that scalable visual self-supervised learning can rival and sometimes exceed language-supervised approaches. This challenges the assumption that language is essential for multimodal vision tasks and highlights dataset composition and scale as critical factors. By providing open-source models spanning a wide range of sizes without dependency on paired data, Meta encourages new research avenues in scalable, language-free vision representation learning.

Explore the WebSSL models on Hugging Face, check the GitHub repository, and read the full paper for deeper insights.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский