Masked Text-to-Audio Flow-Matching and Reward Feedback Optimization

Rongjie Huang; Dongchao Yang; Wenxiang Guo; Huadai Liu; Xize Cheng; Zehan Wang; Zhou Zhao; Xixin Wu; Helen Meng

Masked Text-to-Audio Flow-Matching and Reward Feedback Optimization

Rongjie Huang, Dongchao Yang, Wenxiang Guo, Huadai Liu, Xize Cheng, Zehan Wang, Zhou Zhao, Xixin Wu, Helen M. Meng

Abstract

Flow-matching generative models have created significant milestones in text-to-audio generation, powered by scalable training with increased data, computational resources, and model size, while their scalable inference remains less explored. In this work, we propose MaskAudioFlow, a continuous flow-matching transformer with masked generative modeling designed for scaling text-to-audio inference-time prediction. Specifically, MaskAudioFlow 1) masks spans of audio frames in training and approximates the continuous velocity vector field with flow-matching objective, and 2) performs inference via masked prediction, where we mask out generation and re-predict them through iterative decoding. To reduce the gap between generation and human preferences, we fine-tune MaskAudioFlow using reward signals from text-audio correspondence and perceptual aesthetics. Experimental results demonstrate that MaskAudioFlow achieves state-of-the-art performance in text-to-audio generation, effectively scaling inference-time computation through iterative masked prediction. Moreover, the preference-tuned model demonstrates superior text-audio alignment faithfulness and enhanced perceptual aesthetics. Audio samples are available at https://MaskAudio.github.io

Anthology ID:: 2026.findings-acl.1891
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37953–37966
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1891/
DOI:
Bibkey:
Cite (ACL):: Rongjie Huang, Dongchao Yang, Wenxiang Guo, Huadai Liu, Xize Cheng, Zehan Wang, Zhou Zhao, Xixin Wu, and Helen M. Meng. 2026. Masked Text-to-Audio Flow-Matching and Reward Feedback Optimization. In Findings of the Association for Computational Linguistics: ACL 2026, pages 37953–37966, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Masked Text-to-Audio Flow-Matching and Reward Feedback Optimization (Huang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1891.pdf
Checklist:: 2026.findings-acl.1891.checklist.pdf

PDF Cite Search Checklist Fix data