Meta AI Unveils Multi-SpatialMLLM for Advanced Multi-Frame Spatial Reasoning in Multi-Modal LLMs

Addressing Spatial Reasoning Limitations in Multi-Modal Language Models

Multi-modal large language models (MLLMs) have advanced as versatile AI assistants capable of processing various visual tasks. However, their deployment as isolated digital systems limits their real-world applicability, especially in fields like robotics and autonomous vehicles where complex spatial reasoning is critical. Current MLLMs often struggle with fundamental spatial tasks such as differentiating left from right.

Moving Beyond Single-Image Analysis

Previous work tackled these issues by incorporating spatial data into training, but mainly focused on single-image scenarios. This limits the model's ability to perceive dynamic environments and multi-frame spatial relationships.

Introducing Multi-SpatialMLLM and MultiSPA Dataset

Researchers from FAIR Meta and the Chinese University of Hong Kong have proposed a new framework to enhance MLLMs with robust multi-frame spatial understanding. This framework integrates three key components: depth perception, visual correspondence, and dynamic perception. Central to this effort is MultiSPA, a large-scale dataset containing over 27 million samples drawn from diverse 3D and 4D scenes.

Comprehensive Multi-Frame Spatial Tasks

Five tasks generate training data: depth perception, visual correspondence, camera movement perception, object movement perception, and object size perception. The Multi-SpatialMLLM model leverages this data to perform scalable and generalizable multi-frame reasoning.

Data Format and Benchmarking

MultiSPA follows a standard MLLM fine-tuning format based on QA pairs, with GPT-4o generating diverse templates for task descriptions, questions, and answers. High-quality annotated datasets such as Aria Digital Twin, Panoptic Studio, TAPVid3D, and ScanNet contribute to the dataset. The benchmark contains 7,800 samples across subtasks.

Performance Gains and Generalization

On the MultiSPA benchmark, Multi-SpatialMLLM records an average 36% improvement over baseline models, achieving 80-90% accuracy on qualitative tasks compared to 50% for baselines. It also outperforms proprietary systems. Notably, it attains 18% accuracy on challenging camera movement prediction tasks where baselines perform near zero. On BLINK, the model reaches nearly 90% accuracy with a 26.4% average gain over baselines. Standard VQA benchmarks confirm the model maintains general MLLM skills without overfitting.

Research Contributions and Applications

This work extends MLLM spatial reasoning to multi-frame contexts, filling a critical research gap. MultiSPA is the first large-scale dataset and benchmark for multi-frame spatial tasks. The experiments demonstrate Multi-SpatialMLLM’s effectiveness, scalability, and generalization. The research reveals benefits of multi-task learning and emergent spatial reasoning abilities, enabling new applications such as multi-frame reward annotation.

For more details, check the Paper, Project Page, and GitHub repository. Follow updates on Twitter, join the ML SubReddit with 95k+ members, and subscribe to the newsletter.