PHYX Benchmark Exposes Physical Reasoning Gaps in Multimodal AI Models
The PHYX benchmark uncovers key weaknesses in current multimodal AI models’ ability to perform physical reasoning, emphasizing the challenge of integrating visual data with symbolic and causal knowledge.
Advances and Limitations of Multimodal Foundation Models
Recent multimodal foundation models have achieved human-competitive accuracy on complex mathematical and disciplinary knowledge benchmarks like AIME, GPQA, MATH-500, and OlympiadBench. Despite these successes, these evaluations largely overlook a critical dimension of machine intelligence: physical reasoning. Physical reasoning requires the integration of disciplinary knowledge, symbolic operations, and real-world constraints, distinguishing it sharply from pure mathematical problem-solving.
The Challenge of Physical Reasoning
Physical problem-solving demands models to interpret implicit conditions, such as understanding a “smooth surface” to imply zero friction, and to maintain physical consistency across reasoning steps since physical laws remain constant regardless of approach. This complexity raises questions about whether current multimodal large language models (MLLMs) possess true advanced reasoning abilities in visual and physical domains closer to real-world scenarios.
Existing Benchmarks and Their Limits
While benchmarks like PhysReason and EMMA test multimodal physics problems, their scope is limited to small physics subsets, insufficient for thoroughly assessing MLLMs’ advanced physics reasoning capabilities. Recognizing this gap, a team of researchers from leading universities introduced PHYX, a comprehensive benchmark dedicated to evaluating physical reasoning in foundation models.
Introducing PHYX: A New Benchmark for Physical Reasoning
PHYX consists of 3,000 visually grounded physics questions spanning six core physics domains: Mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics. It emphasizes multimodal problem-solving through three key innovations:
- A large dataset of newly collected questions depicting realistic physical scenarios that require integrated visual and causal reasoning.
- Expert-validated question design ensuring coverage across fundamental physics disciplines.
- A strict unified three-step evaluation protocol to rigorously assess reasoning capabilities.
Data Collection and Quality Control
The data collection involved a four-stage process beginning with a comprehensive survey of core physics disciplines to ensure diverse coverage. STEM graduate students acted as expert annotators. To maintain originality and quality, questions without readily available answers were selected, and a three-stage cleaning process was applied, including duplicate detection via lexical overlap and manual review by physics Ph.D. students. The shortest 10% of questions by length were filtered out, resulting in 3,000 high-quality questions from an initial 3,300.
Benchmark Results and Implications
PHYX poses significant challenges: even the lowest-performing human experts scored 75.6%, outperforming all tested models. The benchmark reveals a performance gap between humans and AI models, especially on open-ended questions demanding genuine reasoning rather than relying on surface-level cues present in multiple-choice formats. GPT-4o’s accuracy on PHYX trails its performance on MathVista and MATH-V benchmarks, illustrating the greater difficulty of physical reasoning which requires deeper integration of abstract concepts and real-world knowledge.
Key Insights and Future Directions
PHYX highlights that current state-of-the-art multimodal models often depend on memorized knowledge, mathematical formulas, and superficial visual patterns rather than true understanding of physical principles. The benchmark is currently limited to English prompts and schematic images, which may not fully capture the complexities of real-world perception and multilingual reasoning. Nevertheless, PHYX sets a new standard for evaluating physical reasoning in AI, encouraging further research into bridging these capability gaps.
For more details, explore the Paper, Code, and Project Page.
Follow the researchers and join the community on Twitter and the ML SubReddit for updates.
Сменить язык
Читать эту статью на русском