Diversity is the Key: Enhancing LLM-based Post-processing for Automated Audio Captioning
Seyed Ali Farokh, Mohammad Mehdi Homayounpour, Ahmad Nickabadi
Abstract
Automated Audio Captioning (AAC) is a multimodal task aimed at generating natural language descriptions of audio content. Previous studies have shown that LLMs can improve AAC performance by summarizing audio events based on a list of candidate captions, which are selected by an external reranker from those generated using Nucleus Sampling. However, the reranking process often selects overly similar captions, disregarding the original diversity of the sampled captions. In this work, we show that this diversity reflects the AAC model’s level of certainty and propose a lightweight candidate selection approach that preserves the initial diversity of the generated captions. This, in turn, enables an LLM to summarize the captions while considering the AAC model’s certainty in a few-shot setting. Experimental results demonstrate that our method outperforms previous post-processing techniques while being significantly faster.- Anthology ID:
- 2025.rocling-main.10
- Volume:
- Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
- Month:
- November
- Year:
- 2025
- Address:
- National Taiwan University, Taipei City, Taiwan
- Editors:
- Kai-Wei Chang, Ke-Han Lu, Chih-Kai Yang, Zhi-Rui Tam, Wen-Yu Chang, Chung-Che Wang
- Venue:
- ROCLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 87–94
- Language:
- URL:
- https://preview.aclanthology.org/dashboard/2025.rocling-main.10/
- DOI:
- Cite (ACL):
- Seyed Ali Farokh, Mohammad Mehdi Homayounpour, and Ahmad Nickabadi. 2025. Diversity is the Key: Enhancing LLM-based Post-processing for Automated Audio Captioning. In Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025), pages 87–94, National Taiwan University, Taipei City, Taiwan. Association for Computational Linguistics.
- Cite (Informal):
- Diversity is the Key: Enhancing LLM-based Post-processing for Automated Audio Captioning (Farokh et al., ROCLING 2025)
- PDF:
- https://preview.aclanthology.org/dashboard/2025.rocling-main.10.pdf