Meta AI Unveils V-JEPA 2: Advanced Open-Source Self-Supervised Models for Video Understanding and Robotic Planning
Meta AI launches V-JEPA 2, a powerful open-source self-supervised model trained on massive video data for advanced visual understanding and robotic planning, achieving state-of-the-art accuracy and efficiency.
Scalable Self-Supervised Learning from Massive Video Data
Meta AI has developed V-JEPA 2, a scalable and open-source world model trained on over 1 million hours of internet video combined with 1 million images. This model utilizes a visual mask denoising objective to reconstruct masked spatiotemporal patches in a latent space, focusing on predicting meaningful scene dynamics rather than raw pixel data.
To achieve this scale, Meta introduced several innovations:
- Data scaling: Created a large dataset (VideoMix22M) aggregated from public sources such as SSv2, Kinetics, HowTo100M, YT-Temporal-1B, and ImageNet.
- Model scaling: Expanded the encoder to over 1 billion parameters using the ViT-g architecture.
- Training schedule: Employed a progressive resolution strategy and extended pretraining to 252,000 iterations.
- Spatial-temporal augmentation: Trained on progressively longer and higher-resolution clips, up to 64 frames at 384×384 resolution.
These design choices resulted in an impressive 88.2% average accuracy across six benchmark tasks, surpassing previous models.
Enhanced Visual Understanding through Masked Representation Learning
V-JEPA 2 demonstrates strong capabilities in motion and appearance understanding. On the Something-Something v2 benchmark, it achieves a top-1 accuracy of 77.3%, outperforming models like InternVideo and VideoMAEv2. For appearance recognition, it competes with leading image-text pretrained models such as DINOv2 and PEcoreG. Attentive probe evaluations confirm that self-supervised learning alone can produce transferable and domain-agnostic visual features applicable across diverse classification tasks.
Temporal Reasoning and Video Question Answering
The model's temporal reasoning was tested by aligning the V-JEPA 2 encoder with a multimodal large language model and evaluating it on various video question-answering benchmarks. Despite no language supervision during pretraining, it achieved strong results:
- 84.0% on PerceptionTest
- 76.9% on TempCompass
- 44.5% on MVP
- 36.7% on TemporalBench
- 40.3% on TOMATO
These outcomes suggest that visual-language alignment need not be co-trained from the start; pretrained video encoders can be effectively aligned afterward for generalization.
V-JEPA 2-AC: Robotic Planning with Latent World Models
A notable extension, V-JEPA 2-AC, is an action-conditioned version fine-tuned on just 62 hours of unlabeled robot video from the Droid dataset. This 300M parameter transformer uses block-causal attention and is trained with teacher-forcing and rollout objectives to predict future video embeddings conditioned on robot actions and poses.
This setup enables zero-shot planning through model-predictive control by minimizing the distance between predicted future states and visual goals using the Cross-Entropy Method (CEM). V-JEPA 2-AC achieves high success rates in reaching, grasping, and pick-and-place tasks on novel robots without reward supervision or extra data collection.
Benchmark Performance and Efficiency
Compared to baseline methods like Octo and Cosmos, V-JEPA 2-AC demonstrates:
- Much faster plan execution (~16 seconds per step versus 4 minutes for Cosmos).
- 100% success rate on reach tasks.
- Superior grasping and manipulation performance across various objects.
Remarkably, it operates using a monocular RGB camera without calibration or environment-specific tuning, highlighting its robust generalization.
Meta’s V-JEPA 2 showcases a breakthrough in leveraging massive passive video data for self-supervised learning, bridging perception and control for intelligent physical agents. The models and resources are openly available on Hugging Face and GitHub.
Сменить язык
Читать эту статью на русском