FineVision: Hugging Face Releases 24M-Sample Open Dataset to Supercharge Vision-Language Models
FineVision at a glance
Hugging Face announced FineVision, a fully open multimodal dataset designed for training Vision-Language Models (VLMs). The dataset aggregates over 200 sources into a unified, rigorously filtered format and provides massive scale: 17.3 million images, 24.3 million samples, 88.9 million question-answer turns, and about 9.5 billion answer tokens. FineVision aims to be both large and low-leakage, with just ~1% overlap with common benchmark test sets.
Scale, coverage and new skill domains
FineVision spans 5 TB of curated data across nine categories, including General VQA, OCR QA, Chart & Table reasoning, Science, Captioning, Grounding & Counting, and GUI navigation. It also includes emerging task data such as GUI navigation, pointing, and counting, broadening the set of abilities VLMs can learn beyond classic captioning and VQA.
Key dataset statistics:
- Images: 17.3M
- Samples: 24.3M
- QA turns: 88.9M
- Answer tokens: ~9.5B
- Estimated leakage with benchmarks: ~1%
How FineVision was created
The team used a three-stage curation pipeline:
Collection and augmentation
- Gathered over 200 public image-text datasets.
- Reformatted missing modalities (for example, text-only sources) into QA pairs.
- Collected targeted data to fill gaps like GUI datasets.
Cleaning
- Removed QA pairs larger than 8192 tokens.
- Resized images to a maximum of 2048 px while preserving aspect ratio.
- Discarded corrupted or otherwise invalid samples.
Quality rating
Every QA pair was scored by Qwen3-32B and Qwen2.5-VL-32B-Instruct across four axes:
- Text formatting quality
- Question-answer relevance
- Visual dependency
- Image-question correspondence
These ratings enable the construction of selective training mixtures. However, ablation studies reported by the authors indicate that keeping the full dataset, including lower-rated samples, tends to produce the best downstream performance.
Performance and comparative advantages
FineVision was compared to other open datasets such as Cauldron, LLaVA-Vision, and Cambrian. Highlights from the paper and experiments:
- Models trained on FineVision demonstrate large benchmark gains across 11 standard tasks (AI2D, ChartQA, DocVQA, ScienceQA, OCRBench, and others), outperforming LLaVA by up to 46.3%, Cauldron by up to 40.7%, and Cambrian by up to 12.1% on some benchmarks.
- FineVision shows lower dataset leakage after deduplication (~1.02% reported) compared with 2–3% for other datasets.
Training insights
- Experiments used a nanoVLM setup (460M parameters) combining SmolLM2-360M-Instruct as the language backbone and SigLIP2-Base-512 as the vision encoder.
- On 32 NVIDIA H100 GPUs, one full epoch (12k steps) takes roughly 20 hours.
- Models trained on FineVision improve steadily, typically overtaking baselines after around 12k steps.
- Multilingual subsets provided modest gains even when the language backbone was primarily monolingual, suggesting that data diversity can outweigh strict language alignment.
- Attempts at multi-stage training (two or 2.5 stages) did not consistently outperform single-stage training with large and diverse data, underscoring that scale and diversity are often more important than complex training schedules.
Why this matters for the research community
FineVision addresses a major gap: many top-performing VLMs are trained on proprietary datasets that limit reproducibility. By open-sourcing a large, well-documented, and low-leakage dataset, Hugging Face enables researchers and developers to reproduce results, experiment with new training mixtures, and push forward capabilities in document analysis, visual reasoning, GUI interactions, and other multimodal tasks.
Access and resources
FineVision is available on the Hugging Face Hub and can be loaded via the datasets library. The project also provides technical documentation, a GitHub page with tutorials, code, and notebooks, and community channels for discussion and updates.