Graph-Score: A Graph-grounded Metric for Audio Captioning

Manh Luong, Gholamreza Haffari, Dinh Phung, Lizhen Qu


Abstract
Evaluating audio captioning systems is a challenging problem since the evaluation process must consider numerous semantic alignments of candidate captions, such as sound event matching and the temporal relationship among them. The existing metrics fail to take these alignments into account as they consider either statistical overlap (BLEU, SPICE, CIDEr) or latent representation similarity (FENSE). To tackle the aforementioned issues of the current metrics, we propose the graph-score, which grounds audio captions to semantic graphs, for better measuring the performance of AAC systems. Our proposed metric achieves the highest agreement with human judgment on the pairwise benchmark datasets. Furthermore, we contribute high-quality benchmark datasets to make progress in developing evaluation metrics for the audio captioning task.
Anthology ID:
2025.alta-main.13
Volume:
Proceedings of The 23rd Annual Workshop of the Australasian Language Technology Association
Month:
November
Year:
2025
Address:
Sydney, Australia
Editors:
Jonathan K. Kummerfeld, Aditya Joshi, Mark Dras
Venue:
ALTA
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
192–201
Language:
URL:
https://preview.aclanthology.org/ingest-alta/2025.alta-main.13/
DOI:
Bibkey:
Cite (ACL):
Manh Luong, Gholamreza Haffari, Dinh Phung, and Lizhen Qu. 2025. Graph-Score: A Graph-grounded Metric for Audio Captioning. In Proceedings of The 23rd Annual Workshop of the Australasian Language Technology Association, pages 192–201, Sydney, Australia. Association for Computational Linguistics.
Cite (Informal):
Graph-Score: A Graph-grounded Metric for Audio Captioning (Luong et al., ALTA 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-alta/2025.alta-main.13.pdf