IBM Launches Granite-Docling-258M — Compact Open-Source Document AI That Preserves Layout

September 18, 2025 · 3 min

What Granite-Docling-258M does

Granite-Docling-258M is an open-source (Apache-2.0) vision-language model from IBM built for end-to-end document conversion with an emphasis on layout-faithful extraction. Instead of producing lossy Markdown, the model outputs DocTags — a structured, machine-readable representation containing elements, coordinates, and relationships that downstream tools can convert to Markdown, HTML, or JSON.

Improvements over SmolDocling

Granite-Docling is the production-ready successor to SmolDocling-256M. IBM replaced the prior backbone with a Granite 165M language model and upgraded the vision encoder to SigLIP2 (base, patch16-512). The connector between vision and language remains an Idefics3-style pixel-shuffle projector. The final model has 258 million parameters and delivers consistent accuracy gains across layout analysis, full-page OCR, code and equation recognition, and table extraction. IBM also fixed instability failure modes seen in the preview, such as repetitive token loops.

Architecture and training pipeline

Backbone: Idefics3-derived stack using SigLIP2 vision encoder -> pixel-shuffle connector -> Granite 165M LLM.
Training framework: nanoVLM, a lightweight pure-PyTorch VLM training toolkit used for efficient training workflows.
Representation: DocTags, IBM’s markup for unambiguous document structure (elements + coordinates + relationships).
Compute: Trained on IBM’s Blue Vela H100 cluster.

DocTags are designed to preserve complex structures like table topology, inline and floating math, code blocks, captions, and explicit reading order. This richer intermediate representation helps downstream retrieval-augmented generation (RAG) and analytics by maintaining grounding and index quality.

Quantified improvements

IBM evaluated Granite-Docling-258M against the SmolDocling-256M preview using docling-eval, LMMS-Eval, and task-specific datasets. Key gains include:

Layout: MAP 0.27 vs. 0.23; F1 0.86 vs. 0.85.
Full-page OCR: F1 0.84 vs. 0.80; lower edit distance.
Code recognition: F1 0.988 vs. 0.915; edit distance 0.013 vs. 0.114.
Equation recognition: F1 0.968 vs. 0.947.
Table recognition (FinTabNet @150dpi): TEDS-structure 0.97 vs. 0.82; TEDS with content 0.96 vs. 0.76.
Other benchmarks: MMStar 0.30 vs. 0.17; OCRBench 500 vs. 338.
Stability: production-oriented fixes reduce infinite-loop / repetitive-token failure modes.

These metrics show meaningful improvements for table topology, code and equation fidelity, and overall OCR robustness.

Multilingual support

Granite-Docling adds experimental support for Japanese, Arabic, and Chinese. IBM considers these early-stage capabilities; English remains the primary target for the model and its evaluations.

How DocTags change Document AI pipelines

Traditional OCR-to-Markdown flows lose structural detail that downstream retrieval and processing rely on. By emitting DocTags, Granite-Docling preserves the document’s structure and coordinates, enabling more accurate conversions and better grounding for RAG. DocTags allow conversion tools to reconstruct tables, math, code blocks, captions, and reading order without guessing or dropping structural metadata.

Inference, runtimes, and integration

IBM recommends using Docling Integration (CLI/SDK) to automatically pull Granite-Docling and convert PDFs, office documents, and images into multiple formats. Runtimes supported include Transformers, vLLM, ONNX, and MLX. A dedicated MLX build is optimized for Apple Silicon, and a Hugging Face Space provides an interactive demo (ZeroGPU). The model is released under the Apache-2.0 license.

Why this matters for enterprises

For production Document AI, a compact VLM that preserves structure can simplify pipelines and cut inference costs. Granite-Docling consolidates multiple single-purpose components (layout, OCR, table, code, equations) into a single model that outputs a richer intermediate representation, improving downstream conversion fidelity and retrieval quality. The measured gains and improved stability make it a practical upgrade from SmolDocling for enterprise document conversion and RAG workflows.