<RETURN_TO_BASE

Glyph Turns Pages into Tokens: 3–4× Visual Compression to Reach Million-Token Contexts

Glyph converts ultra-long text into page images processed by a VLM to achieve 3–4× effective token compression and roughly 4× faster prefill and decoding on 128K inputs.

What Glyph proposes

Glyph, released by Zhipu AI, reframes long-context modeling by rendering very long textual sequences into page images and letting a vision–language model (VLM) process those images end to end. Each visual token encodes many characters, which shortens the effective token sequence while preserving semantics. The approach targets extreme-context workloads and claims 3–4× effective token compression without degrading accuracy.

Why convert text to images

Traditional techniques for extending context — larger positional encodings or modified attention — still scale compute and memory with token count. Retrieval reduces input length but can miss evidence and add latency. Glyph changes the representation itself: increase information density per token by moving to a visual modality. A VLM that already learns OCR, layout parsing, and multimodal reasoning can cover more original context under the same token budget.

System design and training pipeline

Glyph's training pipeline has three main stages:

  • Continual pretraining: the VLM is exposed to large corpora of rendered long text with diverse typography and styles. Objectives align visual and textual representations and transfer long-context skills from text tokens to visual tokens.
  • LLM-driven rendering search: a genetic loop driven by an LLM mutates rendering parameters such as page size, DPI, font family and size, line height, alignment, indent, and spacing. Candidates are evaluated on validation sets to jointly optimize accuracy and compression.
  • Post-training: supervised fine-tuning and reinforcement learning with Group Relative Policy Optimization (GRPO), plus an auxiliary OCR alignment task. The OCR loss specifically improves character fidelity when fonts are small and spacing is tight.

The rendering search is an automated way to find typography and layout settings that balance compression and readability for OCR-like VLM processing.

Evaluation, performance and efficiency

Glyph was evaluated on long-context benchmarks including LongBench, MRCR, and Ruler. Reported results include:

  • Average effective compression about 3.3× on LongBench (with some tasks approaching 5×) and about 3.0× on MRCR.
  • Prefill speedups of about 4.8×, decoding speedups around 4.4×, and supervised fine-tuning throughput improvements of roughly 2× compared to a text backbone at 128K inputs.
  • DPI trade-offs: dpi 72 produced an average compression of 4.0× and maximum 7.7× on specific subtasks; dpi 96 yielded average 2.2× and max 4.4×; dpi 120 gave average 1.2× and max 2.8×. Higher DPI at inference tends to improve scores because crisper glyphs help OCR and layout parsing.

An extreme result shows a 128K-context VLM addressing tasks that originate from roughly 1M-token inputs under aggressive visual compression.

Applications and limitations

Glyph is especially useful for multimodal document understanding and long dialogue or document tasks where preserving large context is critical. Training on rendered pages improves performance on document benchmarks compared to base visual models, suggesting the rendering objective is a strong pretext for real document tasks involving figures and layout.

Key limitations include sensitivity to aggressive typography: very small fonts and tight spacing degrade character accuracy, particularly for rare alphanumeric strings. The approach assumes server-side rendering and a VLM with solid OCR and layout priors; some subtasks (for example, UUID detection) were excluded when character fidelity was insufficient.

Key takeaways

Glyph reframes long-context scaling as a multimodal visual-text compression problem: render long text into images, process with a VLM, and achieve substantial token compression while retaining semantics. The team reports 3–4× token compression with accuracy comparable to strong 8B text baselines on long-context benchmarks, plus notable speed and memory gains. Code, model cards, and weights are available on GitHub and Hugging Face, and the paper is published on arXiv: https://arxiv.org/pdf/2510.17800

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский