Researchers Introduce General-Level and General-Bench to Measure True Synergy in Multimodal AI
New research introduces General-Level and General-Bench to measure true synergy in multimodal AI models, revealing current systems lack full integration across tasks and modalities.
The Evolution of Multimodal AI
Artificial intelligence has expanded beyond language-only systems, evolving into models that can process multiple types of inputs such as text, images, audio, and video. This field, known as multimodal learning, seeks to emulate the natural human ability to integrate and interpret diverse sensory information. Unlike traditional AI systems that focus on a single modality, multimodal generalist models are developed to understand and respond across different data formats, aiming to replicate human-like cognition by combining various types of knowledge and perception.
The Challenge of True Generalization and Synergy
Despite progress, the key challenge lies in enabling these multimodal systems to demonstrate genuine generalization. Many current models can handle multiple input types but often lack the ability to transfer learning effectively across different tasks or modalities. This missing cross-task enhancement, called synergy, impedes advances toward more intelligent and adaptive AI. A model might perform well separately in image classification and text generation, but it cannot be considered a true generalist without the ability to integrate skills from both domains. Achieving synergy is crucial for developing autonomous AI systems capable of complex understanding and reasoning.
Limitations of Current Approaches
Most existing multimodal AI tools rely heavily on large language models (LLMs) enhanced by external specialized modules for tasks like image recognition or speech processing. For example, models such as CLIP or Flamingo combine language and vision but lack deep integration between these modalities. Instead of operating as unified systems, they function with loosely connected components that simulate multimodal intelligence. This fragmented architecture prevents meaningful cross-modal learning, resulting in isolated task performance rather than comprehensive understanding.
Introducing General-Level and General-Bench
A collaborative team of researchers from the National University of Singapore (NUS), Nanyang Technological University (NTU), Zhejiang University (ZJU), Peking University (PKU), and others proposed a new framework called General-Level alongside a benchmark named General-Bench to evaluate and encourage synergy across modalities and tasks.
General-Level categorizes models into five levels based on their ability to integrate comprehension, generation, and language tasks effectively. General-Bench supports this framework with a vast dataset containing over 700 tasks and 325,800 annotated examples from text, images, audio, video, and 3D data.
Measuring Synergy Across Tasks and Modalities
General-Level assesses models not only on task performance but also on their capability to surpass state-of-the-art specialist systems by leveraging shared knowledge. Researchers identify three types of synergy: task-to-task, comprehension-generation, and modality-modality. Each level demands increasing synergy capabilities. For instance, Level-2 models support many modalities and tasks, while Level-4 models must demonstrate synergy between comprehension and generation. The scoring system is designed to minimize bias from dominant modalities and to promote balanced support across diverse tasks.
Benchmarking Comprehensive Multimodal Models
The research team evaluated 172 large models, including more than 100 top multimodal large language models (MLLMs), using General-Bench. Results show that most models lack the necessary synergy to reach higher generalist levels. Notably, advanced models like GPT-4V and GPT-4o did not achieve Level 5, which requires enhancing language understanding using non-language inputs. Even the best-performing models only exhibited basic multimodal interactions without full synergy across tasks and modalities.
General-Bench covers 702 tasks across 145 skills and spans 29 disciplines, employing 58 evaluation metrics. This comprehensive scope sets a new standard for assessing multimodal AI capabilities.
Advancing the Future of Multimodal AI
This research highlights the gap between current multimodal AI systems and the envisioned generalist models capable of true integration. By introducing General-Level and General-Bench, the researchers provide valuable tools that prioritize holistic integration over task specialization. Their work lays a foundation for developing AI systems with enhanced flexibility, real-world applicability, and deeper cross-modal understanding.
For more information, check out the Paper and Project Page. Follow the research community on Twitter and join the r/machinelearningnews subreddit with over 90,000 members to stay updated.
About Marktechpost
Marktechpost builds a vibrant AI and machine learning community through various channels:
- ML News Community – r/machinelearningnews (92k+ members)
- Newsletter – airesearchinsights.com (30k+ subscribers)
- miniCON AI Events – minicon.marktechpost.com
- AI Reports & Magazines – magazine.marktechpost.com
- AI Dev & Research News – marktechpost.com (1M+ monthly readers)
Сменить язык
Читать эту статью на русском