Can GPT-4o Truly See? Benchmarking Multimodal Models on Visual Understanding

Progress and Challenges of Multimodal Foundation Models

Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have rapidly advanced, especially showcased through public demos. While their language understanding is well-documented, their capacity to interpret visual information remains less clear. Current benchmarks largely focus on tasks with text outputs, such as Visual Question Answering (VQA) or classification, which often highlight language skills more than genuine visual comprehension. Important visual tasks like 3D perception, segmentation, and grouping are often neglected in these evaluations.

Evaluating Visual Skills Beyond Text-Based Tasks

MFMs have shown strong results in combined visual-language tasks, including captioning and VQA. However, their performance on tasks requiring detailed visual analysis is uncertain. Since these models typically output text and are accessed through APIs, comparing them fairly to vision-specialized models is challenging. Some attempts convert visual annotations to text to fit MFMs’ input-output format, but this restricts evaluation to language-based outputs. Prompting strategies have been developed to break down complex visual tasks into smaller, manageable language subtasks, though reproducibility is sometimes an issue.

EPFL’s Benchmarking Study of MFMs on Core Vision Tasks

Researchers at EPFL tested prominent MFMs—GPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet—on fundamental computer vision tasks such as segmentation, object detection, and depth prediction. They used datasets including COCO and ImageNet. To accommodate MFMs’ text-based output, they created a prompt-chaining framework that transforms visual tasks into text-compatible subtasks. Results revealed that while MFMs are effective generalists, they fall short of specialized vision models, especially on geometric tasks. GPT-4o led performance, ranking best in 4 out of 6 tasks. The team plans to open-source the evaluation toolkit.

Prompt Chaining Strategy for Visual Task Decomposition

The study introduced a method to decompose complex vision tasks into simpler language-friendly subtasks through prompt chaining. For example, instead of direct bounding box prediction, the model first identifies objects, then locates them by recursively cropping images. For segmentation and grouping, images are split into superpixels for easier labeling and comparison. Depth and surface normals are estimated by pairwise ranking of superpixel regions. This modular approach leverages MFMs’ strengths in classification and similarity assessment, with calibration techniques to ensure fair comparisons. Performance improves with more detailed prompting.

Performance Comparison and Limitations

The evaluation covered multiple MFMs and tasks including image classification, object detection, and segmentation, using datasets like ImageNet, COCO, and Hypersim. GPT-4o achieved 77.2% accuracy on ImageNet and 60.62 AP50 for object detection, while specialist models like ViT-G and Co-DETR scored above 90%. For semantic segmentation, GPT-4o reached 44.89 mIoU compared to OneFormer's 65.52. MFMs handled distribution shifts reasonably well but lagged in precise visual reasoning. The study also introduced oracle baselines to estimate upper-limit performance.

Key Takeaways and Future Outlook

The benchmarking framework provides a unified way to assess MFMs’ visual understanding by translating vision tasks into prompt-based formats. MFMs perform better on semantic tasks than on geometric ones, with GPT-4o currently the top performer among them. Despite trailing behind task-specific vision models, MFMs demonstrate promising progress, especially newer reasoning models like o3 on 3D tasks. Challenges remain, including high inference costs and sensitivity to prompt design. This work lays the foundation for future improvements in multimodal vision capabilities.

For more information, check out the [Paper], [GitHub Page], and [Project] from the original researchers.