STRICT: Stress-Test of Rendering Image Containing Text

Tianyu Zhang; Xinyu Wang; Lu Li; Zhenghan Tai; Jijun Chi; Jingrui Tian; Hailin He; Suyuchen Wang

STRICT: Stress-Test of Rendering Image Containing Text

Tianyu Zhang, Xinyu Wang, Lu Li, Zhenghan Tai, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang

Abstract

While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle with generating consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their capacity to model long-range spatial dependencies. In this paper, we introduce STRICT, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated and (2) the correctness and legibility of the generated text. We assess several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling.

Anthology ID:: 2025.emnlp-main.1070
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21148–21161
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1070/
DOI:
Bibkey:
Cite (ACL):: Tianyu Zhang, Xinyu Wang, Lu Li, Zhenghan Tai, Jijun Chi, Jingrui Tian, Hailin He, and Suyuchen Wang. 2025. STRICT: Stress-Test of Rendering Image Containing Text. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21148–21161, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: STRICT: Stress-Test of Rendering Image Containing Text (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1070.pdf
Checklist:: 2025.emnlp-main.1070.checklist.pdf

PDF Cite Search Checklist Fix data