Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs

Osma Suominen, Juho Inkinen, Mona Lehtinen


Abstract
This paper presents the Annif system in SemEval-2025 Task 5 (LLMs4Subjects), which focussed on subject indexing using large language models (LLMs). The task required creating subject predictions for bibliographic records from the bilingual TIBKAT database using the GND subject vocabulary. Our approach combines traditional natural language processing and machine learning techniques implemented in the Annif toolkit with innovative LLM-based methods for translation and synthetic data generation, and merging predictions from monolingual models. The system ranked first in the all-subjects category and second in the tib-core-subjects category in the quantitative evaluation, and fourth in qualitative evaluations. These findings demonstrate the potential of combining traditional XMTC algorithms with modern LLM techniques to improve the accuracy and efficiency of subject indexing in multilingual contexts.
Anthology ID:
2025.semeval-1.315
Volume:
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Sara Rosenthal, Aiala Rosá, Debanjan Ghosh, Marcos Zampieri
Venues:
SemEval | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2424–2431
Language:
URL:
https://preview.aclanthology.org/corrections-2025-08/2025.semeval-1.315/
DOI:
Bibkey:
Cite (ACL):
Osma Suominen, Juho Inkinen, and Mona Lehtinen. 2025. Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 2424–2431, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs (Suominen et al., SemEval 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-08/2025.semeval-1.315.pdf