Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

Shunian Chen; Xinyuan Xie; Zheshu Chen; Owen Lee; Liyan Zhao; Zhan Su; Qilin Sun; Benyou Wang

Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

Shunian Chen, Xinyuan Xie, Zheshu Chen, Owen Lee, Liyan Zhao, Zhan Su, Qilin Sun, Benyou Wang

Abstract

High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments.

Anthology ID:: 2026.acl-long.1285
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27888–27913
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1285/
DOI:
Bibkey:
Cite (ACL):: Shunian Chen, Xinyuan Xie, Zheshu Chen, Owen Lee, Liyan Zhao, Zhan Su, Qilin Sun, and Benyou Wang. 2026. Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27888–27913, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion (Chen et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1285.pdf
Checklist:: 2026.acl-long.1285.checklist.pdf

PDF Cite Search Checklist Fix data