NuMind Unveils NuMarkdown-8B-Thinking: A Reasoning VLM That Turns Scanned Documents into Clean Markdown

A new approach to document OCR

NuMind AI released NuMarkdown-8B-Thinking, an open source MIT-licensed vision-language model that goes beyond simple text extraction. Instead of treating OCR as a straight transcription task, this model reasons about layout, structure, and formatting and produces ready-to-use Markdown output that preserves the original document's organization.

Reasoning-first OCR

NuMarkdown-8B-Thinking uses a reasoning-first pipeline. The model generates internal 'thinking tokens' that represent intermediate layout and structure inferences before producing the final Markdown. These internal steps let the model handle complex real-world formats that often break conventional OCR tools, including:

Multi-column pages with nontrivial reading order
Tables with merged, nested, or irregular cells
Documents mixing images, decorative headers, and watermarks
Historical or degraded scans where layout cues are faint

The number of reasoning tokens scales with complexity, ranging from about 20% to 500% of the final Markdown length, reflecting how much internal inference the model performs before writing the result.

Training and architecture

NuMarkdown-8B-Thinking is a fine-tuned variant of the Qwen 2.5-VL-7B multi-modal model from Alibaba. The training pipeline had two main stages:

Supervised fine-tuning on synthetic document samples. Each sample contained the raw document input, intermediate reasoning steps capturing layout parsing and structure inference, and the final Markdown representation.
Reinforcement learning with GRPO and a layout-centric reward that prioritized faithful reconstruction of formatting and spatial relationships.

This two-stage approach improved the model's ability to reproduce complex layouts with human-level judgment where needed.

How it performs vs other models

Independent evaluations and user testing position NuMarkdown-8B-Thinking as a state-of-the-art reasoning model for OCR-to-Markdown tasks. Highlights from benchmarks and user feedback include:

Beats generalist models like GPT-4o and specialized OCR-focused models like OCRFlux on layout reasoning and structured Markdown output
Competes with large closed-source reasoning models such as Gemini 2.5
Ranks close to elite systems like Gemini Flash Reasoning in blind multi-model comparisons

Users especially praise its ability to infer correct reading order in non-linear layouts, preserve intricate table formatting, and produce parsing-friendly Markdown for Retrieval-Augmented Generation workflows without heavy post-processing.

Example workflow

Consider a scanned annual report page containing multi-level headings, sidebars, multiple columns, a financial table with merged cells, and a legal footer. NuMarkdown-8B-Thinking first emits 'thinking tokens' that describe structure elements like column boundaries, table spans, and footer placement, then outputs Markdown that mirrors both content and layout. This transparent intermediate reasoning also makes the model's decisions auditable, an advantage for enterprise, legal, and archival use cases.

Deployment and licensing

NuMarkdown-8B-Thinking is available on Hugging Face for direct testing and integration. Model weights and quantized GGUF versions are published for local CPU/GPU deployment, and the model is compatible with OpenAI-style APIs and Hugging Face Transformers for quick pipeline integration. The MIT License ensures freedom for commercial, academic, and personal projects with no vendor lock-in.

Why it matters

For sectors that need faithful document digitization like finance, legal, healthcare, and archives, layout fidelity matters as much as textual accuracy. By treating layout inference as an explicit reasoning problem and producing RAG-friendly Markdown, NuMarkdown-8B-Thinking offers an open, verifiable, and high-performance alternative to many proprietary document AI solutions.