iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models

Michel Olvera, Changhong Wang, Paraskevas Stamatiadis, Gaël Richard, Slim Essid


Abstract
Contrastive Language–Audio Pretraining (CLAP) models learn by aligning audio and text in a shared embedding space, enabling powerful zero-shot recognition. However, their performance is highly sensitive to prompt formulation and language nuances, and they often inherit semantic ambiguities and spurious correlations from noisy pretraining data. While prior work has explored prompt engineering, adapters, and prefix tuning to address these limitations, the use of structured prior knowledge remains largely unexplored. We present iKnow-audio, a framework that integrates knowledge graphs with audio-language models to provide robust semantic grounding. iKnow-audio builds on the Audio-centric Knowledge Graph (AKG), which encodes ontological relations comprising semantic, causal, and taxonomic connections reflective of everyday sound scenes and events. By training knowlege graph embedding models on the AKG and refining CLAP predictions through this structured knowledge, iKnow-audio improves disambiguation of acoustically similar sounds and reduces reliance on prompt engineering. Comprehensive zero-shot evaluations across six benchmark datasets demonstrate consistent gains over baseline CLAP, supported by embedding-space analyses that highlight improved relational grounding. Resources are publicly available at https://github.com/michelolzam/iknow-audio
Anthology ID:
2025.emnlp-main.1759
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
34671–34688
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1759/
DOI:
Bibkey:
Cite (ACL):
Michel Olvera, Changhong Wang, Paraskevas Stamatiadis, Gaël Richard, and Slim Essid. 2025. iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34671–34688, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models (Olvera et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1759.pdf
Checklist:
 2025.emnlp-main.1759.checklist.pdf