NVIDIA's Eagle 2.5: Compact Vision-Language Model Excels in Long-Context Video Tasks Matching GPT-4o

Breaking Barriers in Long-Context Multimodal Understanding

Vision-language models have advanced rapidly but often struggle with processing extended multimodal inputs like high-resolution images or long videos. Traditional models face challenges such as losing semantic details, inefficient memory use, and performance drops when scaling to long-context data.

Introducing Eagle 2.5: Efficient and Generalist

NVIDIA’s Eagle 2.5 is designed specifically to tackle these challenges. It is a family of vision-language models optimized for long-context learning, handling both images and videos effectively. Despite having only 8 billion parameters, Eagle 2.5 achieves performance comparable to much larger models, such as Qwen2.5-VL-72B and InternVL2.5-78B, particularly on video tasks with 512-frame inputs.

Innovative Training Strategies

Eagle 2.5 employs two key training strategies that enhance its long-context capabilities:

Information-First Sampling: Prioritizes important visual and semantic content using novel techniques like Image Area Preservation (IAP) to keep over 60% of the original image area with minimal distortion, and Automatic Degradation Sampling (ADS) to balance visual and textual inputs dynamically.
Progressive Post-Training: Gradually increases the model’s context window from 32K to 128K tokens, enabling stable performance across various input lengths without overfitting.

The architecture leverages SigLIP for vision encoding and MLP projection layers, maintaining flexibility by avoiding task-specific compression modules.

Eagle-Video-110K: A Rich Dataset for Long-Form Video Understanding

A critical component is the Eagle-Video-110K dataset, combining open-source and proprietary data with dual annotation schemes:

Top-Down: Story-level segmentation with human-annotated chapters and GPT-4 generated dense captions and QA pairs.
Bottom-Up: Short clip QA pairs generated via GPT-4o with time and textual context anchors.

This dataset balances diversity and narrative coherence, filtered using cosine similarity from sources like InternVid and VidChapters.

Impressive Performance Across Benchmarks

Eagle 2.5-8B excels in both video and image benchmarks, scoring 74.8 on MVBench, 77.6 on MLVU, 66.4 on LongVideoBench, and strong results on DocVQA, ChartQA, and InfoVQA. Ablation studies highlight the importance of its sampling methods and progressive training, with Eagle-Video-110K significantly boosting high-frame-count performance.

A New Direction for Vision-Language Models

Eagle 2.5 proves that sophisticated training and data strategies can match or exceed large-scale models without massive parameter counts. This makes it a promising foundation for AI systems that require deep contextual understanding in real-world multimedia applications.

Check out the Paper, GitHub, and Project Pages for more information, and follow NVIDIA’s AI community on Twitter, Telegram, and LinkedIn.