PolyAudio: Advancing Multi-Audio Reasoning in Large Audio Language Models with Interleaved Multi-Audio Contexts

Sonal Kumar, Sreyan Ghosh, Yueqian Lin, S Sakshi, Ashish Seth, Yiran Chen, Ramani Duraiswami, Dinesh Manocha


Abstract
Large Audio Language Models have shown impressive performance on single-clip audio language tasks such as automatic speech recognition, captioning, and sound event recognition. Yet, their ability to reason over interleaved multi-audio contexts-where answering a query requires relating information across multiple audio clips-remains limited. We present PolyAudio, a LALM built on Audio Flamingo 3 that targets multi-audio understanding via instruction tuning rather than massive-scale pre-training, and PolyAudio-Instruct, a high-quality instruction-tuning dataset consisting of 1.3M+ QA pairs, spanning over 14 task subsets to empower multi-audio understanding and reasoning. PolyAudio uses an explicit interleaved representation with clip indexing to encourage faithful grounding and reduce ambiguity in multi-clip references. We evaluate PolyAudio on a diverse suite of multi-audio benchmarks alongside standard single-audio tasks. PolyAudio achieves strong performance on multi-audio reasoning, outperforming competitive baselines that are also often limited to reasoning over up-to 2 audio clips, while preserving robust single-clip performance. Overall, our results suggest that precise, academic-scale multi-audio instruction tuning can unlock advanced cross-clip reasoning capabilities, enabling more capable audio-centric assistants.
Anthology ID:
2026.findings-acl.2101
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
42335–42353
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2101/
DOI:
Bibkey:
Cite (ACL):
Sonal Kumar, Sreyan Ghosh, Yueqian Lin, S Sakshi, Ashish Seth, Yiran Chen, Ramani Duraiswami, and Dinesh Manocha. 2026. PolyAudio: Advancing Multi-Audio Reasoning in Large Audio Language Models with Interleaved Multi-Audio Contexts. In Findings of the Association for Computational Linguistics: ACL 2026, pages 42335–42353, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
PolyAudio: Advancing Multi-Audio Reasoning in Large Audio Language Models with Interleaved Multi-Audio Contexts (Kumar et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2101.pdf
Checklist:
 2026.findings-acl.2101.checklist.pdf