CAIDAS at SemEval-2025 Task 7: Enriching Sparse Datasets with LLM-Generated Content for Improved Information Retrieval

Dominik Benchert; Severin Meßlinger; Sven Goller; Jonas Kaiser; Jan Pfister; Andreas Hotho

CAIDAS at SemEval-2025 Task 7: Enriching Sparse Datasets with LLM-Generated Content for Improved Information Retrieval

Dominik Benchert, Severin Meßlinger, Sven Goller, Jonas Kaiser, Jan Pfister, Andreas Hotho

Abstract

The focus of SemEval-2024 Task 7 is the retrieval of relevant fact-checks for social media posts across multiple languages. We approach this task with an enhanced bi-encoder retrieval setup, which is designed to match social media posts with relevant fact-checks using synthetic data from LLMs. We explored and analyzed two main approaches for generating synthetic posts. Either based on existing fact-checks or on existing posts. Our approach achieved an S@10 score of 89.53% for the monolingual task and 74.48% for the crosslingual task, ranking 16th out of 28 and 13th out of 29, respectively. Without data augmentation, scores would have been 88.69 (17th) and 72.93 (15th).

Anthology ID:: 2025.semeval-1.214
Volume:: Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Sara Rosenthal, Aiala Rosá, Debanjan Ghosh, Marcos Zampieri
Venues:: SemEval | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1623–1638
Language:
URL:: https://preview.aclanthology.org/corrections-2025-08/2025.semeval-1.214/
DOI:
Bibkey:
Cite (ACL):: Dominik Benchert, Severin Meßlinger, Sven Goller, Jonas Kaiser, Jan Pfister, and Andreas Hotho. 2025. CAIDAS at SemEval-2025 Task 7: Enriching Sparse Datasets with LLM-Generated Content for Improved Information Retrieval. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 1623–1638, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: CAIDAS at SemEval-2025 Task 7: Enriching Sparse Datasets with LLM-Generated Content for Improved Information Retrieval (Benchert et al., SemEval 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2025-08/2025.semeval-1.214.pdf

PDF Cite Search Fix data