Mirage: Enabling Visual Reasoning in Vision-Language Models Without Image Generation

Limitations of Current Vision-Language Models (VLMs)

Vision-Language Models excel at understanding both text and images, but their reasoning often relies solely on textual information. This limits their effectiveness in tasks requiring visual thinking, such as solving spatial puzzles. Humans naturally use mental imagery to visualize solutions rather than describing every detail verbally, a capacity VLMs currently lack. Although some recent models generate both text and images, training for image generation can weaken reasoning abilities and does not inherently support step-by-step visual reasoning.

Existing Approaches to Multimodal Reasoning

Chain-of-Thought (CoT) prompting encourages models to reason step-by-step by providing intermediate explanations. Extensions of CoT to multimodal tasks embed visual information directly into the reasoning process. For example, ICoT integrates image regions within text sequences, while Visual CoT uses visual annotations to enhance spatial understanding. However, these models often require extensive supervision and are computationally expensive. Alternative research explores embedding reasoning internally via hidden states using special tokens or latent representations, bypassing explicit reasoning steps.

Introducing Mirage: Mental Imagery Inspired Framework

Researchers from the University of Massachusetts Amherst and MIT propose Mirage, a novel framework inspired by human mental imagery. Rather than generating full images, Mirage enables VLMs to interleave compact visual cues derived from hidden states directly into text outputs. This approach mimics how humans form internal, task-relevant visuals while thinking.

Mirage’s training involves two phases: first, grounding compressed visual features (latent tokens) through joint supervision with both text and visual data; second, relaxing constraints to allow the model to generate latent tokens autonomously to guide reasoning. A final reinforcement learning phase refines performance by rewarding accuracy and structured reasoning.

Performance on Spatial Reasoning Tasks

The model was evaluated on four spatial reasoning benchmarks, including visual puzzles and geometry problems, using a dataset of 1,000 training samples. To facilitate reasoning, synthetic helper images and intermediate thought steps were generated to simulate human visual cues like sketches. Mirage consistently outperformed both text-only and multimodal baseline models, even on complex tasks requiring extensive planning such as maze solving. A smaller model variant also showed strong results, highlighting method robustness. Ablation studies emphasized the importance of initially grounding latent visual tokens followed by flexible training.

Implications and Future Directions

Mirage demonstrates that interleaving visual and textual reasoning without generating actual images improves understanding and accuracy in vision-language tasks. This lightweight approach enables VLMs to think more like humans by leveraging internal visual cues. Challenges remain in scaling the approach to broader tasks and enhancing synthetic training data quality.