FineVision: Hugging Face Releases 24M-Sample Open Dataset to Supercharge Vision-Language Models

September 6, 2025 · 3 min

FineVision at a glance

Hugging Face announced FineVision, a fully open multimodal dataset designed for training Vision-Language Models (VLMs). The dataset aggregates over 200 sources into a unified, rigorously filtered format and provides massive scale: 17.3 million images, 24.3 million samples, 88.9 million question-answer turns, and about 9.5 billion answer tokens. FineVision aims to be both large and low-leakage, with just ~1% overlap with common benchmark test sets.

Scale, coverage and new skill domains

FineVision spans 5 TB of curated data across nine categories, including General VQA, OCR QA, Chart & Table reasoning, Science, Captioning, Grounding & Counting, and GUI navigation. It also includes emerging task data such as GUI navigation, pointing, and counting, broadening the set of abilities VLMs can learn beyond classic captioning and VQA.

Key dataset statistics:

Images: 17.3M
Samples: 24.3M
QA turns: 88.9M
Answer tokens: ~9.5B
Estimated leakage with benchmarks: ~1%

How FineVision was created

The team used a three-stage curation pipeline:

Collection and augmentation

Gathered over 200 public image-text datasets.
Reformatted missing modalities (for example, text-only sources) into QA pairs.
Collected targeted data to fill gaps like GUI datasets.

Cleaning

Removed QA pairs larger than 8192 tokens.
Resized images to a maximum of 2048 px while preserving aspect ratio.
Discarded corrupted or otherwise invalid samples.

Quality rating

Every QA pair was scored by Qwen3-32B and Qwen2.5-VL-32B-Instruct across four axes:

Text formatting quality
Question-answer relevance
Visual dependency
Image-question correspondence

These ratings enable the construction of selective training mixtures. However, ablation studies reported by the authors indicate that keeping the full dataset, including lower-rated samples, tends to produce the best downstream performance.

Performance and comparative advantages

FineVision was compared to other open datasets such as Cauldron, LLaVA-Vision, and Cambrian. Highlights from the paper and experiments:

Models trained on FineVision demonstrate large benchmark gains across 11 standard tasks (AI2D, ChartQA, DocVQA, ScienceQA, OCRBench, and others), outperforming LLaVA by up to 46.3%, Cauldron by up to 40.7%, and Cambrian by up to 12.1% on some benchmarks.
FineVision shows lower dataset leakage after deduplication (~1.02% reported) compared with 2–3% for other datasets.

Training insights

Experiments used a nanoVLM setup (460M parameters) combining SmolLM2-360M-Instruct as the language backbone and SigLIP2-Base-512 as the vision encoder.
On 32 NVIDIA H100 GPUs, one full epoch (12k steps) takes roughly 20 hours.
Models trained on FineVision improve steadily, typically overtaking baselines after around 12k steps.
Multilingual subsets provided modest gains even when the language backbone was primarily monolingual, suggesting that data diversity can outweigh strict language alignment.
Attempts at multi-stage training (two or 2.5 stages) did not consistently outperform single-stage training with large and diverse data, underscoring that scale and diversity are often more important than complex training schedules.

Why this matters for the research community

FineVision addresses a major gap: many top-performing VLMs are trained on proprietary datasets that limit reproducibility. By open-sourcing a large, well-documented, and low-leakage dataset, Hugging Face enables researchers and developers to reproduce results, experiment with new training mixtures, and push forward capabilities in document analysis, visual reasoning, GUI interactions, and other multimodal tasks.

Access and resources

FineVision is available on the Hugging Face Hub and can be loaded via the datasets library. The project also provides technical documentation, a GitHub page with tutorials, code, and notebooks, and community channels for discussion and updates.