Jijun Chi


2025

pdf bib
STRICT: Stress-Test of Rendering Image Containing Text
Tianyu Zhang | Xinyu Wang | Lu Li | Zhenghan Tai | Jijun Chi | Jingrui Tian | Hailin He | Suyuchen Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle with generating consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their capacity to model long-range spatial dependencies. In this paper, we introduce STRICT, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated and (2) the correctness and legibility of the generated text. We assess several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling.