dots.ocr: 1.7B Vision-Language Model Sets New Standard for Multilingual Document Parsing

dots.ocr is an open-source vision-language transformer built to parse and extract content from documents across more than 100 languages. It merges layout detection and OCR into one flexible model, simplifying pipelines and improving consistency when processing scanned pages, PDFs, and complex documents.

Unified architecture

dots.ocr unifies layout detection and content recognition inside a single transformer-based network. Rather than running separate detection and OCR systems, users switch tasks through prompt adjustments. The model has about 1.7 billion parameters, intended to balance performance and resource requirements for practical deployments.

Input flexibility and preprocessing

The model accepts images and PDF files and includes preprocessing options such as fitz_preprocess to improve results on low-resolution scans or dense multi-page documents. Preprocessing helps preserve layout fidelity and reading order prior to extraction.

Multilingual and structured extraction

Trained on datasets spanning over 100 languages and varied scripts, dots.ocr supports extraction of plain text, tables, mathematical formulas in LaTeX, and preserves document structure like table boundaries and image placements. Output formats include structured JSON for programmatic use, plus Markdown and HTML where appropriate.

Benchmark results

In head-to-head evaluations against modern document AI systems, dots.ocr shows competitive or superior performance, particularly on tables and text precision. Example benchmark summary:

| Task | dots.ocr | Gemini2.5-Pro | |---|---:|---:| | Table TEDS accuracy | 88.6% | 85.8% | | Text edit distance | 0.032 | 0.055 |

These results indicate higher table parsing accuracy and lower text edit distance for dots.ocr, while formula recognition and layout reconstruction match or exceed leading models.

Deployment and integration

Released under the MIT license, dots.ocr is fully open-source with code, documentation, and pretrained weights available on GitHub. The repository contains installation instructions for pip, Conda, and Docker. The model supports prompt templates for flexible task configuration and can be used interactively or in automated batch pipelines. Visualization scripts help inspect detected layouts and validate extraction quality.

Who benefits from dots.ocr

dots.ocr is suited for teams and projects that need robust, language-agnostic document analysis: data extraction from invoices, academic papers, forms, and multi-lingual archives where preserving structure and reading order is critical. Its single-model approach simplifies deployment in production and resource-constrained environments.

Explore the project and the full documentation on the GitHub repository: https://github.com/rednote-hilab/dots.ocr/blob/master/assets/blog.md

dots.ocr: 1.7B Vision-Language Model Sets New Standard for Multilingual Document Parsing

Сменить язык