Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers

Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, Shujun Wang


Abstract
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren opens a promising pathway toward unified multi-modal generation frameworks.
Anthology ID:
2025.emnlp-main.1322
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
26036–26054
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1322/
DOI:
Bibkey:
Cite (ACL):
Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, and Shujun Wang. 2025. Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26036–26054, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers (Wang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1322.pdf
Checklist:
 2025.emnlp-main.1322.checklist.pdf