<RETURN_TO_BASE

MiMo-VL-7B: Advancing Visual and Multimodal AI with State-of-the-Art Vision-Language Model

MiMo-VL-7B is a powerful vision-language model developed by Xiaomi researchers, offering state-of-the-art performance in visual understanding and multimodal reasoning through advanced training techniques.

Introducing MiMo-VL-7B: A Compact Yet Powerful Vision-Language Model

Vision-language models (VLMs) play a crucial role in multimodal AI applications, enabling machines to interpret visual contexts, perform reasoning across different modalities, and interact effectively with digital and physical environments. Xiaomi researchers have developed MiMo-VL-7B, a compact VLM designed to deliver powerful visual and multimodal reasoning capabilities. It comprises three main components: a native-resolution Vision Transformer (ViT) encoder that retains detailed visual information, a Multi-Layer Perceptron (MLP) projector for efficient alignment between vision and language modalities, and the MiMo-7B language model optimized for complex reasoning.

Comprehensive Two-Stage Training Process

MiMo-VL-7B undergoes an extensive two-phase training process. The first phase is a four-stage pre-training routine that includes projector warmup, vision-language alignment, general multimodal pre-training, and long-context supervised fine-tuning. This phase leverages an enormous dataset of 2.4 trillion tokens, producing the MiMo-VL-7B-SFT variant. The second phase involves post-training using Mixed On-policy Reinforcement Learning (MORL), which integrates multiple reward signals related to perception accuracy, visual grounding, logical reasoning, and human preferences, culminating in the MiMo-VL-7B-RL model.

Architecture Details

The model architecture integrates three key components: (a) a Vision Transformer encoder (Qwen2.5-ViT) supporting native resolution inputs for images and videos, (b) a projector that maps visual features into a latent space aligned with the language model, and (c) the MiMo-7B language model backbone known for its strong reasoning abilities. The training data is diverse, including multimodal datasets, image captions, OCR data, grounding data, videos, GUI interactions, and reasoning examples, ensuring robust learning across various tasks.

Enhancing Performance with MORL

The post-training phase employs MORL, combining Reinforcement Learning with Verifiable Rewards (RLVR) and Reinforcement Learning from Human Feedback (RLHF). RLVR uses rule-based rewards for continuous, verifiable improvement on reasoning and perception tasks, while RLHF ensures alignment with human preferences and reduces undesirable outputs. MORL optimizes both objectives simultaneously to boost the model's overall reasoning and alignment capabilities.

Superior Evaluation Results

Evaluations across 50 diverse tasks demonstrate MiMo-VL-7B’s leading performance among open-source models. The MiMo-VL-7B-SFT and MiMo-VL-7B-RL models achieve 64.6% and 66.7% respectively on the MMMUval benchmark, outperforming larger models such as Gemma 3 27B. On document understanding tasks like CharXivRQ, the RL model significantly surpasses competitors by wide margins. In multimodal reasoning, both variants outperform open-source baselines, with the RL model further enhancing accuracy in MathVision tasks. The model also excels at GUI understanding and grounding, rivaling specialized models on benchmarks like Screenspot-Pro and OSWorld-G.

Leading Open-Source Vision-Language Model

MiMo-VL-7B holds the highest Elo rating among evaluated open-source VLMs, outperforming models ranging from 7B to 72B parameters and approaching proprietary models such as Claude 3.7 Sonnet. MORL contributes a remarkable 22+ point improvement over the SFT model, underscoring the effectiveness of the training strategy and the model’s strong general-purpose capabilities.

Contribution to the Research Community

The researchers have open-sourced their extensive evaluation suite to foster transparency and reproducibility in multimodal AI research. This work represents a significant advancement for open-source vision-language models and offers valuable insights into training methodologies that balance diverse capabilities and reasoning demands.

For further details, explore the Paper, GitHub Page, and the Model on Hugging Face. Follow the project updates on Twitter and join the vibrant ML community on Reddit and via newsletters.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский