NuMind Unveils NuMarkdown-8B-Thinking: A Reasoning VLM That Turns Scanned Documents into Clean Markdown
'NuMind launched NuMarkdown-8B-Thinking, a reasoning-first OCR VLM that infers layout and outputs clean Markdown ideal for RAG and document archiving.'
A new approach to document OCR
NuMind AI released NuMarkdown-8B-Thinking, an open source MIT-licensed vision-language model that goes beyond simple text extraction. Instead of treating OCR as a straight transcription task, this model reasons about layout, structure, and formatting and produces ready-to-use Markdown output that preserves the original document's organization.
Reasoning-first OCR
NuMarkdown-8B-Thinking uses a reasoning-first pipeline. The model generates internal 'thinking tokens' that represent intermediate layout and structure inferences before producing the final Markdown. These internal steps let the model handle complex real-world formats that often break conventional OCR tools, including:
- Multi-column pages with nontrivial reading order
- Tables with merged, nested, or irregular cells
- Documents mixing images, decorative headers, and watermarks
- Historical or degraded scans where layout cues are faint
The number of reasoning tokens scales with complexity, ranging from about 20% to 500% of the final Markdown length, reflecting how much internal inference the model performs before writing the result.
Training and architecture
NuMarkdown-8B-Thinking is a fine-tuned variant of the Qwen 2.5-VL-7B multi-modal model from Alibaba. The training pipeline had two main stages:
- Supervised fine-tuning on synthetic document samples. Each sample contained the raw document input, intermediate reasoning steps capturing layout parsing and structure inference, and the final Markdown representation.
- Reinforcement learning with GRPO and a layout-centric reward that prioritized faithful reconstruction of formatting and spatial relationships.
This two-stage approach improved the model's ability to reproduce complex layouts with human-level judgment where needed.
How it performs vs other models
Independent evaluations and user testing position NuMarkdown-8B-Thinking as a state-of-the-art reasoning model for OCR-to-Markdown tasks. Highlights from benchmarks and user feedback include:
- Beats generalist models like GPT-4o and specialized OCR-focused models like OCRFlux on layout reasoning and structured Markdown output
- Competes with large closed-source reasoning models such as Gemini 2.5
- Ranks close to elite systems like Gemini Flash Reasoning in blind multi-model comparisons
Users especially praise its ability to infer correct reading order in non-linear layouts, preserve intricate table formatting, and produce parsing-friendly Markdown for Retrieval-Augmented Generation workflows without heavy post-processing.
Example workflow
Consider a scanned annual report page containing multi-level headings, sidebars, multiple columns, a financial table with merged cells, and a legal footer. NuMarkdown-8B-Thinking first emits 'thinking tokens' that describe structure elements like column boundaries, table spans, and footer placement, then outputs Markdown that mirrors both content and layout. This transparent intermediate reasoning also makes the model's decisions auditable, an advantage for enterprise, legal, and archival use cases.
Deployment and licensing
NuMarkdown-8B-Thinking is available on Hugging Face for direct testing and integration. Model weights and quantized GGUF versions are published for local CPU/GPU deployment, and the model is compatible with OpenAI-style APIs and Hugging Face Transformers for quick pipeline integration. The MIT License ensures freedom for commercial, academic, and personal projects with no vendor lock-in.
Why it matters
For sectors that need faithful document digitization like finance, legal, healthcare, and archives, layout fidelity matters as much as textual accuracy. By treating layout inference as an explicit reasoning problem and producing RAG-friendly Markdown, NuMarkdown-8B-Thinking offers an open, verifiable, and high-performance alternative to many proprietary document AI solutions.
Сменить язык
Читать эту статью на русском