Meta AI Unveils V-JEPA 2: Advanced Open-Source Self-Supervised Models for Video Understanding and Robotic Planning

Scalable Self-Supervised Learning from Massive Video Data

Meta AI has developed V-JEPA 2, a scalable and open-source world model trained on over 1 million hours of internet video combined with 1 million images. This model utilizes a visual mask denoising objective to reconstruct masked spatiotemporal patches in a latent space, focusing on predicting meaningful scene dynamics rather than raw pixel data.

To achieve this scale, Meta introduced several innovations:

Data scaling: Created a large dataset (VideoMix22M) aggregated from public sources such as SSv2, Kinetics, HowTo100M, YT-Temporal-1B, and ImageNet.
Model scaling: Expanded the encoder to over 1 billion parameters using the ViT-g architecture.
Training schedule: Employed a progressive resolution strategy and extended pretraining to 252,000 iterations.
Spatial-temporal augmentation: Trained on progressively longer and higher-resolution clips, up to 64 frames at 384×384 resolution.

These design choices resulted in an impressive 88.2% average accuracy across six benchmark tasks, surpassing previous models.

Enhanced Visual Understanding through Masked Representation Learning

V-JEPA 2 demonstrates strong capabilities in motion and appearance understanding. On the Something-Something v2 benchmark, it achieves a top-1 accuracy of 77.3%, outperforming models like InternVideo and VideoMAEv2. For appearance recognition, it competes with leading image-text pretrained models such as DINOv2 and PEcoreG. Attentive probe evaluations confirm that self-supervised learning alone can produce transferable and domain-agnostic visual features applicable across diverse classification tasks.

Temporal Reasoning and Video Question Answering

The model's temporal reasoning was tested by aligning the V-JEPA 2 encoder with a multimodal large language model and evaluating it on various video question-answering benchmarks. Despite no language supervision during pretraining, it achieved strong results:

84.0% on PerceptionTest
76.9% on TempCompass
44.5% on MVP
36.7% on TemporalBench
40.3% on TOMATO

These outcomes suggest that visual-language alignment need not be co-trained from the start; pretrained video encoders can be effectively aligned afterward for generalization.

V-JEPA 2-AC: Robotic Planning with Latent World Models

A notable extension, V-JEPA 2-AC, is an action-conditioned version fine-tuned on just 62 hours of unlabeled robot video from the Droid dataset. This 300M parameter transformer uses block-causal attention and is trained with teacher-forcing and rollout objectives to predict future video embeddings conditioned on robot actions and poses.

This setup enables zero-shot planning through model-predictive control by minimizing the distance between predicted future states and visual goals using the Cross-Entropy Method (CEM). V-JEPA 2-AC achieves high success rates in reaching, grasping, and pick-and-place tasks on novel robots without reward supervision or extra data collection.

Benchmark Performance and Efficiency

Compared to baseline methods like Octo and Cosmos, V-JEPA 2-AC demonstrates:

Much faster plan execution (~16 seconds per step versus 4 minutes for Cosmos).
100% success rate on reach tasks.
Superior grasping and manipulation performance across various objects.

Remarkably, it operates using a monocular RGB camera without calibration or environment-specific tuning, highlighting its robust generalization.