MMLONGBENCH: Setting a New Standard for Long-Context Vision-Language Models

Progress in Long-Context Vision-Language Modeling

Recent developments in long-context (LC) modeling have expanded the capabilities of large language models (LLMs) and large vision-language models (LVLMs). Long-context vision-language models (LCVLMs) enable processing of hundreds of images alongside thousands of interleaved text tokens within a single forward pass, marking a significant advancement. However, evaluation benchmarks have not kept pace, leaving the true performance and robustness of these models underexplored.

Limitations of Current Benchmarks

Existing benchmarks suffer from several shortcomings: limited downstream task coverage, narrow variety in image types, lack of control over input context lengths, and evaluation at a single fixed context length. Techniques like longer pre-training, position extrapolation, and efficient architectures have extended LVLM context windows. Models such as Gemini-2.5 and Qwen2.5-VL utilize these strategies combined with vision token compression to handle longer sequences. Despite this, evaluation remains constrained mainly to Needle-in-a-Haystack tasks and long-document VQA, which fail to fully assess LC capabilities across diverse vision-language scenarios.

Introducing MMLONGBENCH

A collaborative effort by researchers from HKUST, Tencent AI Seattle Lab, University of Edinburgh, Miniml.AI, and NVIDIA AI Technology Center has resulted in MMLONGBENCH, the first comprehensive benchmark dedicated to LCVLM evaluation. It contains 13,331 examples spanning five downstream task categories including Visual Retrieval-Augmented Generation (RAG) and Many-Shot In-Context Learning (ICL), covering both natural and synthetic image types.

All tasks are standardized across five input lengths ranging from 8K to 128K tokens, using a cross-modal tokenization scheme that combines vision patches and text tokens. This uniform design enables rigorous assessment of models’ long-context abilities.

Evaluation Methodology

To construct long contexts, gold passages with answers are inserted among numerous distracting passages retrieved from Wikipedia. ViQuAE uses gold passages from KILT, while InfoSeek leverages lead sections from Wikipedia entity pages. Wikipedia content is segmented into 100-word passages, with distractors added to reach target input lengths.

Many-shot in-context learning tasks incorporate four diverse image classification datasets: Stanford Cars, Food101, SUN397, and iNat2021, fitting 500 images within 128K token contexts. Cross-modal token counting integrates Llama2 tokenized text with visual tokens processed through 14×14 patches and 2×2 pixel unshuffle compression, ensuring compatibility with modern LVLMs.

Benchmark Results

Testing 46 closed-source and open-source models reveals widespread challenges. Performance on single tasks does not reliably predict overall long-context capability. Closed-source models generally outperform open-source counterparts. For the longest input length (128K tokens), all models face difficulties; GPT-4o attains only a 62.9 average score. Gemini-2.5-Pro leads, surpassing open models by 20 points except on ICL tasks. Ovis2-34B and GPT-4o show comparable summarization scores, while Qwen2.5-VL-32B achieves the highest SubEM score on Visual RAG, outperforming Gemini-2.0-Flash. Notably, some models generalize beyond their training context lengths, such as Qwen2-VL-72B scoring 51.9 at 128K despite being trained on 32K tokens.

Impact and Future Directions

MMLONGBENCH establishes a rigorous evaluation framework covering diverse tasks with unified token counting and controlled context lengths. It highlights the challenges frontier models face, particularly in OCR accuracy and cross-modal retrieval, and underscores the limitations of relying on single-task performance for LC capability assessment.

This benchmark aims to stimulate advancements in efficient vision-language token encoding, robust position extrapolation, and enhanced multi-modal retrieval and reasoning, guiding future research toward more capable long-context vision-language models.

For more details, see the Paper and GitHub Page. Follow the project on Twitter and join the ML SubReddit and Newsletter communities.