Weihan Wang
2026
Glyph: Scaling Context Windows via Visual-Text Compression
Jiale Cheng | Yusen Liu | Xinyu Zhang | Yulin Fei | Wenyi Hong | Ruiliang Lyu | Weihan Wang | Zhe Su | Xiaotao Gu | Xiao Liu | Yushi Bai | Jie Tang | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiale Cheng | Yusen Liu | Xinyu Zhang | Yulin Fei | Wenyi Hong | Ruiliang Lyu | Weihan Wang | Zhe Su | Xiaotao Gu | Xiao Liu | Yushi Bai | Jie Tang | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) conventionally represent text as sequences of discrete tokens, making long-context scaling largely a matter of processing more tokens more efficiently.We instead explore a complementary direction: increasing how much original context each token represents.To this end, we introduce Glyph, a framework that renders long texts into compact visual pages and processes them with a vision-language model (VLM), allowing a fixed context window to cover substantially more text.To make visual compression practical, Glyph combines continual pre-training on rendered long-text data, an LLM-driven genetic search to identify rendering configurations that balance compression and task performance, and post-training with supervised fine-tuning and reinforcement learning.Across multiple long-context benchmarks, Glyph achieves 3–4× token compression while maintaining performance comparable to strong text-only LLMs such as Qwen3-8B, with over 4× faster prefilling and decoding and 2× faster supervised fine-tuning.Under more aggressive compression, a VLM with a 128K context window can handle tasks that would otherwise require up to 1M input tokens.Our code and model are released at https://github.com/thu-coai/Glyph.