FineVision: Hugging Face Releases 24M-Sample Open Dataset to Supercharge Vision-Language Models

FineVision at a glance

Hugging Face announced FineVision, a fully open multimodal dataset designed for training Vision-Language Models (VLMs). The dataset aggregates over 200 sources into a unified, rigorously filtered format and provides massive scale: 17.3 million images, 24.3 million samples, 88.9 million question-answer turns, and about 9.5 billion answer tokens. FineVision aims to be both large and low-leakage, with just ~1% overlap with common benchmark test sets.

Scale, coverage and new skill domains

FineVision spans 5 TB of curated data across nine categories, including General VQA, OCR QA, Chart & Table reasoning, Science, Captioning, Grounding & Counting, and GUI navigation. It also includes emerging task data such as GUI navigation, pointing, and counting, broadening the set of abilities VLMs can learn beyond classic captioning and VQA.

Key dataset statistics:

How FineVision was created

The team used a three-stage curation pipeline:

Collection and augmentation

Cleaning

Quality rating

Every QA pair was scored by Qwen3-32B and Qwen2.5-VL-32B-Instruct across four axes:

These ratings enable the construction of selective training mixtures. However, ablation studies reported by the authors indicate that keeping the full dataset, including lower-rated samples, tends to produce the best downstream performance.

Performance and comparative advantages

FineVision was compared to other open datasets such as Cauldron, LLaVA-Vision, and Cambrian. Highlights from the paper and experiments:

Training insights

Why this matters for the research community

FineVision addresses a major gap: many top-performing VLMs are trained on proprietary datasets that limit reproducibility. By open-sourcing a large, well-documented, and low-leakage dataset, Hugging Face enables researchers and developers to reproduce results, experiment with new training mixtures, and push forward capabilities in document analysis, visual reasoning, GUI interactions, and other multimodal tasks.

Access and resources

FineVision is available on the Hugging Face Hub and can be loaded via the datasets library. The project also provides technical documentation, a GitHub page with tutorials, code, and notebooks, and community channels for discussion and updates.