Timestep Embeddings Trigger Collapse in Diffusion Text Generation

Ryota Nosaka, Takuya Matsuzaki


Abstract
Diffusion models have achieved remarkable success in various generative tasks, particularly in image and audio synthesis, which work by iteratively refining random noise into realistic data. Recent studies have highlighted the potential of diffusion models for text generation, but several challenges remain unresolved. One significant issue is that the model begins to degrade a previous sample rather than improve it after a certain timestep in the generation process, resulting in broken text. In this paper, we reveal that timestep embeddings are a principal cause of the collapse problem by analyzing their interactions with word embeddings. Further, we propose two key methods: (a) a simple lightweight word embedding technique that enhances model analyzability as well as learning efficiency; (b) a novel regularization on both word and timestep embeddings. Experimental results demonstrate that our approach effectively mitigates the collapse problem and can lead to a considerable improvement in the quality of generated text.
Anthology ID:
2025.conll-1.26
Volume:
Proceedings of the 29th Conference on Computational Natural Language Learning
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Gemma Boleda, Michael Roth
Venues:
CoNLL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
397–406
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.conll-1.26/
DOI:
10.18653/v1/2025.conll-1.26
Bibkey:
Cite (ACL):
Ryota Nosaka and Takuya Matsuzaki. 2025. Timestep Embeddings Trigger Collapse in Diffusion Text Generation. In Proceedings of the 29th Conference on Computational Natural Language Learning, pages 397–406, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Timestep Embeddings Trigger Collapse in Diffusion Text Generation (Nosaka & Matsuzaki, CoNLL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.conll-1.26.pdf