PolyAudio: Advancing Multi-Audio Reasoning in Large Audio Language Models with Interleaved Multi-Audio Contexts
Sonal Kumar, Sreyan Ghosh, Yueqian Lin, S Sakshi, Ashish Seth, Yiran Chen, Ramani Duraiswami, Dinesh Manocha
Abstract
Large Audio Language Models have shown impressive performance on single-clip audio language tasks such as automatic speech recognition, captioning, and sound event recognition. Yet, their ability to reason over interleaved multi-audio contexts-where answering a query requires relating information across multiple audio clips-remains limited. We present PolyAudio, a LALM built on Audio Flamingo 3 that targets multi-audio understanding via instruction tuning rather than massive-scale pre-training, and PolyAudio-Instruct, a high-quality instruction-tuning dataset consisting of 1.3M+ QA pairs, spanning over 14 task subsets to empower multi-audio understanding and reasoning. PolyAudio uses an explicit interleaved representation with clip indexing to encourage faithful grounding and reduce ambiguity in multi-clip references. We evaluate PolyAudio on a diverse suite of multi-audio benchmarks alongside standard single-audio tasks. PolyAudio achieves strong performance on multi-audio reasoning, outperforming competitive baselines that are also often limited to reasoning over up-to 2 audio clips, while preserving robust single-clip performance. Overall, our results suggest that precise, academic-scale multi-audio instruction tuning can unlock advanced cross-clip reasoning capabilities, enabling more capable audio-centric assistants.- Anthology ID:
- 2026.findings-acl.2101
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 42335–42353
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2101/
- DOI:
- Cite (ACL):
- Sonal Kumar, Sreyan Ghosh, Yueqian Lin, S Sakshi, Ashish Seth, Yiran Chen, Ramani Duraiswami, and Dinesh Manocha. 2026. PolyAudio: Advancing Multi-Audio Reasoning in Large Audio Language Models with Interleaved Multi-Audio Contexts. In Findings of the Association for Computational Linguistics: ACL 2026, pages 42335–42353, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- PolyAudio: Advancing Multi-Audio Reasoning in Large Audio Language Models with Interleaved Multi-Audio Contexts (Kumar et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2101.pdf