Mirage: Enabling Visual Reasoning in Vision-Language Models Without Image Generation
Mirage introduces a new method for vision-language models to integrate visual reasoning without generating images, significantly enhancing their ability to solve spatial and multimodal tasks.
Limitations of Current Vision-Language Models (VLMs)
Vision-Language Models excel at understanding both text and images, but their reasoning often relies solely on textual information. This limits their effectiveness in tasks requiring visual thinking, such as solving spatial puzzles. Humans naturally use mental imagery to visualize solutions rather than describing every detail verbally, a capacity VLMs currently lack. Although some recent models generate both text and images, training for image generation can weaken reasoning abilities and does not inherently support step-by-step visual reasoning.
Existing Approaches to Multimodal Reasoning
Chain-of-Thought (CoT) prompting encourages models to reason step-by-step by providing intermediate explanations. Extensions of CoT to multimodal tasks embed visual information directly into the reasoning process. For example, ICoT integrates image regions within text sequences, while Visual CoT uses visual annotations to enhance spatial understanding. However, these models often require extensive supervision and are computationally expensive. Alternative research explores embedding reasoning internally via hidden states using special tokens or latent representations, bypassing explicit reasoning steps.
Introducing Mirage: Mental Imagery Inspired Framework
Researchers from the University of Massachusetts Amherst and MIT propose Mirage, a novel framework inspired by human mental imagery. Rather than generating full images, Mirage enables VLMs to interleave compact visual cues derived from hidden states directly into text outputs. This approach mimics how humans form internal, task-relevant visuals while thinking.
Mirage’s training involves two phases: first, grounding compressed visual features (latent tokens) through joint supervision with both text and visual data; second, relaxing constraints to allow the model to generate latent tokens autonomously to guide reasoning. A final reinforcement learning phase refines performance by rewarding accuracy and structured reasoning.
Performance on Spatial Reasoning Tasks
The model was evaluated on four spatial reasoning benchmarks, including visual puzzles and geometry problems, using a dataset of 1,000 training samples. To facilitate reasoning, synthetic helper images and intermediate thought steps were generated to simulate human visual cues like sketches. Mirage consistently outperformed both text-only and multimodal baseline models, even on complex tasks requiring extensive planning such as maze solving. A smaller model variant also showed strong results, highlighting method robustness. Ablation studies emphasized the importance of initially grounding latent visual tokens followed by flexible training.
Implications and Future Directions
Mirage demonstrates that interleaving visual and textual reasoning without generating actual images improves understanding and accuracy in vision-language tasks. This lightweight approach enables VLMs to think more like humans by leveraging internal visual cues. Challenges remain in scaling the approach to broader tasks and enhancing synthetic training data quality.
For more details, refer to the Paper and the GitHub Page. All credit goes to the original researchers behind this work.
Сменить язык
Читать эту статью на русском