FIGMA: Towards FIne-Grained Music retrievAl

Nishit Anand; Ashish Seth; Sreyan Ghosh; Dinesh Manocha; Ramani Duraiswami

FIGMA: Towards FIne-Grained Music retrievAl

Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha, Ramani Duraiswami

Abstract

Retrieving music using natural language descriptions has improved with contrastive audio–text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (Fine-Grained Music Retrieval), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio–text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music–caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.

Anthology ID:: 2026.acl-long.2197
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 47559–47572
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2197/
DOI:
Bibkey:
Cite (ACL):: Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha, and Ramani Duraiswami. 2026. FIGMA: Towards FIne-Grained Music retrievAl. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 47559–47572, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: FIGMA: Towards FIne-Grained Music retrievAl (Anand et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2197.pdf
Checklist:: 2026.acl-long.2197.checklist.pdf

PDF Cite Search Checklist Fix data