UniME: Advancing Multimodal Representations with a Two-Stage MLLM Framework

Challenges in Current Multimodal Representation Learning

The CLIP framework has been pivotal in multimodal representation learning, especially for image-text retrieval tasks. Despite its success, CLIP has notable limitations such as a rigid 77-token limit on text input, a dual-encoder design separating image and text processing, and a compositional understanding akin to bag-of-words models. These constraints reduce its ability to capture subtle, instruction-sensitive semantics. Meanwhile, Multimodal Large Language Models (MLLMs) like LLaVA, Qwen2-VL, and CogVLM provide advancements in vision-language reasoning but are limited by their autoregressive next-token prediction objective, which restricts learning generalized, transferable embeddings.

Recent Innovations and Remaining Challenges

Recent research has focused on overcoming these drawbacks by innovating architectures and training techniques. For example, E5-V introduces unimodal contrastive training to align cross-modal features, while VLM2Vec offers the MMEB benchmark to transform advanced vision-language models into effective embedding generators. Other models such as LLM2Vec and NV-Embed modify attention mechanisms in decoder-only LLMs to improve text-based representation learning. However, challenges persist in managing long sequences, enhancing cross-modal fusion, and effectively distinguishing hard negatives in contrastive learning.

UniME Framework: A Two-Stage Approach

Researchers from The University of Sydney, DeepGlint, Tongyi Lab at Alibaba, and Imperial College London propose UniME, a novel two-stage framework for enhancing multimodal representation learning with MLLMs.

Stage One: Textual Discriminative Knowledge Distillation This stage involves transferring knowledge from a strong LLM teacher to improve the language encoder by training the student MLLM with text-only prompts. This enhances the quality of textual embeddings.
Stage Two: Hard Negative Enhanced Instruction Tuning This stage improves the model's discriminative power and instruction-following ability by filtering out false negatives and sampling multiple challenging negatives per instance. It also utilizes task-specific prompts to enhance performance across applications like retrieval and visual question answering.

Evaluation and Results

UniME was trained using PyTorch with DeepSpeed on 8 NVIDIA A100 GPUs. The training included a textual knowledge distillation phase with 273,000 NLI dataset pairs and a hard negative instruction tuning phase on 662,000 multimodal pairs. NV-Embed V2 served as the teacher model. UniME was evaluated on 36 datasets from the MMEB benchmark, consistently outperforming baselines such as E5-V and VLM2Vec.

Hard negatives significantly improved the model’s ability to distinguish subtle differences, especially in tasks involving long captions and compositional retrieval. Ablation studies verified the effectiveness of both training stages and tuning parameters.

Impact of UniME