NVIDIA Unveils Cosmos-Reason1: Revolutionizing Physical Common Sense and Embodied AI Reasoning
NVIDIA introduces Cosmos-Reason1, a new suite of AI models designed to enhance physical common sense and embodied reasoning using multimodal learning and innovative ontologies, improving AI interaction in real-world environments.
Bridging AI and the Physical World
AI has made remarkable progress in language, math, and code generation, yet understanding and interacting with physical environments remains a complex challenge. Physical AI aims to bridge this gap by creating systems that perceive sensory inputs like video and respond based on real-world physics. These systems are essential for navigation, manipulation, and interaction in dynamic settings, relying on common-sense reasoning about space, time, and physical laws.
Limitations of Current AI Models
Most existing AI models excel at abstract tasks but lack a grounded understanding of physical phenomena such as gravity or spatial relationships. This limits their reliability in embodied tasks. Training AI directly in physical environments is costly and risky, causing slow progress. Furthermore, fragmented tools and inconsistent evaluation frameworks have hindered advancements in physical reasoning.
Introducing Cosmos-Reason1
NVIDIA researchers have launched Cosmos-Reason1, a suite of multimodal large language models specifically designed for physical reasoning. The two models, Cosmos-Reason1-7B and Cosmos-Reason1-56B, undergo dual-phase training: Physical AI Supervised Fine-Tuning (SFT) and Physical AI Reinforcement Learning (RL).
Dual-Ontology Framework
A key innovation is the dual-ontology system guiding training and evaluation. One hierarchical ontology categorizes physical common sense into Space, Time, and Fundamental Physics, broken down into 16 subcategories. The second ontology maps reasoning capabilities across five embodied agents, including humans, robot arms, humanoid robots, and autonomous vehicles. This structure creates a standardized framework for benchmarking AI’s physical reasoning skills.
Multimodal Architecture and Training Data
Cosmos-Reason1 combines a decoder-only large language model with a vision encoder, processing video features alongside language tokens in a shared space. Training involves a vast dataset of around 4 million annotated video-text pairs featuring action descriptions, multiple-choice questions, and chain-of-thought reasoning. Reinforcement learning optimizes performance using rule-based rewards derived from human annotations and self-supervised video tasks like predicting video temporal direction and solving spatiotemporal puzzles.
Benchmarking and Performance
The team developed benchmarks for physical common sense (604 questions from 426 videos) and embodied reasoning (610 questions from 600 videos). Cosmos-Reason1 models outperformed previous baselines, especially after reinforcement learning, excelling in task completion verification, next-action prediction, and assessing physical feasibility. The larger Cosmos-Reason1-56B model showed stronger performance across metrics.
Implications and Applications
These advancements pave the way for more capable AI in robotics, autonomous driving, and human-machine collaboration, where real-time perception and physical reasoning are critical. By integrating structured ontologies and multimodal data, Cosmos-Reason1 represents a significant leap toward AI systems that understand and interact with the physical world effectively.
For further details, explore the Paper, Project Page, Models on Hugging Face, and GitHub repository. Follow related updates on Twitter and join the ML SubReddit or subscribe to the newsletter.
Сменить язык
Читать эту статью на русском