Zixuan Lan


2025

pdf bib
Text or Pixels? Evaluating Efficiency and Understanding of LLMs with Visual Text Inputs
Yanhong Li | Zixuan Lan | Jiawei Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: Can we compress textual inputs by feeding them as images to reduce token usage while preserving performance?In this paper, we show that *visual text representations* are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit this idea by rendering long text inputs as a single image and providing it directly to the model. This approach dramatically reduces the number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks — RULER (long-context retrieval) and CNN/DailyMail (document summarization) — we demonstrate that this text-as-image method yields substantial token savings *without degrading task performance*.