The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz, Craig Thomson, Javier González Corbelle, Malo Ruelle
Abstract
This paper presents an overview of, and the results from, the 2025 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’25) which followed on from four previous shared tasks on reproducibility of evaluations, ReproNLP’24, ReproNLP’23, ReproGen’22 and ReproGen’21. This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, against a backdrop of increasing recognition of the importance of the topic across the two fields. We describe the ReproNLP’25 shared task, summarise results from the reproduction studies submitted, and provide additional comparative analysis of their results, including for the first time additional, ‘sanity-check’ evaluations by LLMs.- Anthology ID:
- 2025.gem-1.78
- Volume:
- Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria and virtual meeting
- Editors:
- Kaustubh Dhole, Miruna Clinciu
- Venues:
- GEM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1002–1016
- Language:
- URL:
- https://preview.aclanthology.org/transition-to-people-yaml/2025.gem-1.78/
- DOI:
- Cite (ACL):
- Anya Belz, Craig Thomson, Javier González Corbelle, and Malo Ruelle. 2025. The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 1002–1016, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
- Cite (Informal):
- The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results (Belz et al., GEM 2025)
- PDF:
- https://preview.aclanthology.org/transition-to-people-yaml/2025.gem-1.78.pdf