Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander T Toshev
Abstract
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models’ capability.- Anthology ID:
- 2024.emnlp-main.75
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1281–1287
- Language:
- URL:
- https://preview.aclanthology.org/moar-dois/2024.emnlp-main.75/
- DOI:
- 10.18653/v1/2024.emnlp-main.75
- Cite (ACL):
- Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, and Alexander T Toshev. 2024. Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1281–1287, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation (Zhang et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/moar-dois/2024.emnlp-main.75.pdf