CAIDAS at SemEval-2025 Task 7: Enriching Sparse Datasets with LLM-Generated Content for Improved Information Retrieval

Dominik Benchert, Severin Meßlinger, Sven Goller, Jonas Kaiser, Jan Pfister, Andreas Hotho


Abstract
The focus of SemEval-2024 Task 7 is the retrieval of relevant fact-checks for social media posts across multiple languages. We approach this task with an enhanced bi-encoder retrieval setup, which is designed to match social media posts with relevant fact-checks using synthetic data from LLMs. We explored and analyzed two main approaches for generating synthetic posts. Either based on existing fact-checks or on existing posts. Our approach achieved an S@10 score of 89.53% for the monolingual task and 74.48% for the crosslingual task, ranking 16th out of 28 and 13th out of 29, respectively. Without data augmentation, scores would have been 88.69 (17th) and 72.93 (15th).
Anthology ID:
2025.semeval-1.214
Volume:
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Sara Rosenthal, Aiala Rosá, Debanjan Ghosh, Marcos Zampieri
Venues:
SemEval | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1623–1638
Language:
URL:
https://preview.aclanthology.org/corrections-2025-08/2025.semeval-1.214/
DOI:
Bibkey:
Cite (ACL):
Dominik Benchert, Severin Meßlinger, Sven Goller, Jonas Kaiser, Jan Pfister, and Andreas Hotho. 2025. CAIDAS at SemEval-2025 Task 7: Enriching Sparse Datasets with LLM-Generated Content for Improved Information Retrieval. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 1623–1638, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
CAIDAS at SemEval-2025 Task 7: Enriching Sparse Datasets with LLM-Generated Content for Improved Information Retrieval (Benchert et al., SemEval 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-08/2025.semeval-1.214.pdf