Benchmarking the Benchmarks: Reproducing Climate-Related NLP Tasks

Tom Calamai, Oana Balalau, Fabian M. Suchanek


Abstract
Significant efforts have been made in the NLP community to facilitate the automatic analysis of climate-related corpora by tasks such as climate-related topic detection, climate risk classification, question answering over climate topics, and many more. In this work, we perform a reproducibility study on 8 tasks and 29 datasets, testing 6 models. We find that many tasks rely heavily on surface-level keyword patterns rather than deeper semantic or contextual understanding. Moreover, we find that 96% of the datasets contain annotation issues, with 16.6% of the sampled wrong predictions of a zero-shot classifier being actually clear annotation mistakes, and 38.8% being ambiguous examples.These results call into question the reliability of current benchmarks to meaningfully compare models and highlight the need for improved annotation practices. We conclude by outlining actionable recommendations to enhance dataset quality and evaluation robustness.
Anthology ID:
2025.findings-acl.925
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17967–18009
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.925/
DOI:
Bibkey:
Cite (ACL):
Tom Calamai, Oana Balalau, and Fabian M. Suchanek. 2025. Benchmarking the Benchmarks: Reproducing Climate-Related NLP Tasks. In Findings of the Association for Computational Linguistics: ACL 2025, pages 17967–18009, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Benchmarking the Benchmarks: Reproducing Climate-Related NLP Tasks (Calamai et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.925.pdf