In-the-wild Audio Spatialization with Flexible Text-guided Localization

Tianrui Pan; Jie Liu; Zewen Huang; Jie Tang; Gangshan Wu

In-the-wild Audio Spatialization with Flexible Text-guided Localization

Tianrui Pan, Jie Liu, Zewen Huang, Jie Tang, Gangshan Wu

Abstract

Binaural audio enriches immersive experiences by enabling the perception of the spatial locations of sounding objects in AR, VR, and embodied AI applications. While existing audio spatialization methods can generally map any available monaural audio to binaural audio signals, they often lack the flexible and interactive control needed in complex multi-object user-interactive environments. To address this, we propose a Text-guided Audio Spatialization (TAS) framework that utilizes diverse text prompts and evaluates our model from unified generation and comprehension perspectives. Due to the limited availability of high-quality, large-scale stereo data, we construct the SpatialTAS dataset, which encompasses 376,000 simulated binaural audio samples to facilitate the training of our model. Our model learns binaural differences guided by 3D spatial location and relative position prompts, enhanced with flipped-channel audio. Experimental results show that our model can generate high quality binaural audios for various audio types on both simulated and real-recorded datasets. Besides, we establish an assessment model based on Llama-3.1-8B, which evaluates the semantic accuracy of spatial locations through a spatial reasoning task. Results demonstrate that by utilizing text prompts for flexible and interactive control, we can generate binaural audio with both high quality and semantic consistency in spatial locations.

Anthology ID:: 2025.acl-long.98
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1989–2001
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.98/
DOI:
Bibkey:
Cite (ACL):: Tianrui Pan, Jie Liu, Zewen Huang, Jie Tang, and Gangshan Wu. 2025. In-the-wild Audio Spatialization with Flexible Text-guided Localization. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1989–2001, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: In-the-wild Audio Spatialization with Flexible Text-guided Localization (Pan et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.98.pdf

PDF Cite Search Fix data