Malo Ruelle
2025
The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz
|
Craig Thomson
|
Javier González Corbelle
|
Malo Ruelle
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
This paper presents an overview of, and the results from, the 2025 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’25) which followed on from four previous shared tasks on reproducibility of evaluations, ReproNLP’24, ReproNLP’23, ReproGen’22 and ReproGen’21. This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, against a backdrop of increasing recognition of the importance of the topic across the two fields. We describe the ReproNLP’25 shared task, summarise results from the reproduction studies submitted, and provide additional comparative analysis of their results, including for the first time additional, ‘sanity-check’ evaluations by LLMs.