Craig Thomson


2024

pdf bib
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Simone Balloccu | Anya Belz | Rudali Huidrom | Ehud Reiter | Joao Sedoc | Craig Thomson
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

pdf
The 2024 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz | Craig Thomson
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

This paper presents an overview of, and the results from, the 2024 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’24), following on from three previous shared tasks on reproducibility of evaluations in NLP, ReproNLP’23, ReproGen’22 and ReproGen’21. This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, against a backdrop of increasing recognition of the importance of reproducibility across the two fields. We describe the ReproNLP’24 shared task, summarise results from the reproduction studies submitted, and provide additional comparative analysis of their results.

2023

pdf bib
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
Anya Belz | Maja Popović | Ehud Reiter | Craig Thomson | João Sedoc
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

pdf
The 2023 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz | Craig Thomson
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

This paper presents an overview of, and the results from, the 2023 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’23), following on from two previous shared tasks on reproducibility of evaluations in NLG, ReproGen’21 and ReproGen’22. This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, all against a background of an interest in reproducibility that con- tinues to grow in the two fields. This paper describes the ReproNLP’23 shared task, summarises results from the reproduction studies submitted, and provides comparative analysis of the results.

pdf
Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Simon Mille
Findings of the Association for Computational Linguistics: ACL 2023

Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirementof all evaluations, but in particular where they are used for meta-evaluation, is that they should support the same conclusions if repeated. However, the reproducibility of human evaluations is virtually never queried, let alone formally tested, in NLP which means that their repeatability and the reproducibility of their results is currently an open question. This focused contribution reports our review of human evaluation experiments reported in NLP papers over the past five years which we assessed in terms oftheir ability to be rerun. Overall, we estimatethat just 5% of human evaluations are repeatable in the sense that (i) there are no prohibitivebarriers to repetition, and (ii) sufficient information about experimental design is publicly available for rerunning them. Our estimate goesup to about 20% when author help is sought. We complement this investigation with a survey of results concerning the reproducibilityof human evaluations where those are repeatable in the first place. Here we find worryinglylow degrees of reproducibility, both in terms ofsimilarity of scores and of findings supportedby them. We summarise what insights can begleaned so far regarding how to make humanevaluations in NLP more repeatable and morereproducible.

pdf bib
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Gavin Abercrombie | Jose M. Alonso-Moral | Mohammad Arvan | Anouck Braggaar | Mark Cieliebak | Elizabeth Clark | Kees van Deemter | Tanvi Dinkar | Ondřej Dušek | Steffen Eger | Qixiang Fang | Mingqi Gao | Albert Gatt | Dimitra Gkatzia | Javier González-Corbelle | Dirk Hovy | Manuela Hürlimann | Takumi Ito | John D. Kelleher | Filip Klubicka | Emiel Krahmer | Huiyuan Lai | Chris van der Lee | Yiru Li | Saad Mahamood | Margot Mieskes | Emiel van Miltenburg | Pablo Mosteiro | Malvina Nissim | Natalie Parde | Ondřej Plátek | Verena Rieser | Jie Ruan | Joel Tetreault | Antonio Toral | Xiaojun Wan | Leo Wanner | Lewis Watson | Diyi Yang
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

pdf
Enhancing factualness and controllability of Data-to-Text Generation via data Views and constraints
Craig Thomson | Clement Rebuffel | Ehud Reiter | Laure Soulier | Somayajulu Sripada | Patrick Gallinari
Proceedings of the 16th International Natural Language Generation Conference

Neural data-to-text systems lack the control and factual accuracy required to generate useful and insightful summaries of multidimensional data. We propose a solution in the form of data views, where each view describes an entity and its attributes along specific dimensions. A sequence of views can then be used as a high-level schema for document planning, with the neural model handling the complexities of micro-planning and surface realization. We show that our view-based system retains factual accuracy while offering high-level control of output that can be tailored based on user preference or other norms within the domain.

2022

pdf
The Accuracy Evaluation Shared Task as a Retrospective Reproduction Study
Craig Thomson | Ehud Reiter
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

We investigate the data collected for the Accuracy Evaluation Shared Task as a retrospective reproduction study. The shared task was based upon errors found by human annotation of computer generated summaries of basketball games. Annotation was performed in three separate stages, with texts taken from the same three systems and checked for errors by the same three annotators. We show that the mean count of errors was consistent at the highest level for each experiment, with increased variance when looking at per-system and/or per-error- type breakdowns.

pdf
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann | Abhik Bhattacharjee | Abinaya Mahendiran | Alex Wang | Alexandros Papangelis | Aman Madaan | Angelina Mcmillan-major | Anna Shvets | Ashish Upadhyay | Bernd Bohnet | Bingsheng Yao | Bryan Wilie | Chandra Bhagavatula | Chaobin You | Craig Thomson | Cristina Garbacea | Dakuo Wang | Daniel Deutsch | Deyi Xiong | Di Jin | Dimitra Gkatzia | Dragomir Radev | Elizabeth Clark | Esin Durmus | Faisal Ladhak | Filip Ginter | Genta Indra Winata | Hendrik Strobelt | Hiroaki Hayashi | Jekaterina Novikova | Jenna Kanerva | Jenny Chim | Jiawei Zhou | Jordan Clive | Joshua Maynez | João Sedoc | Juraj Juraska | Kaustubh Dhole | Khyathi Raghavi Chandu | Laura Perez Beltrachini | Leonardo F . R. Ribeiro | Lewis Tunstall | Li Zhang | Mahim Pushkarna | Mathias Creutz | Michael White | Mihir Sanjay Kale | Moussa Kamal Eddine | Nico Daheim | Nishant Subramani | Ondrej Dusek | Paul Pu Liang | Pawan Sasanka Ammanamanchi | Qi Zhu | Ratish Puduppully | Reno Kriz | Rifat Shahriyar | Ronald Cardenas | Saad Mahamood | Salomey Osei | Samuel Cahyawijaya | Sanja Štajner | Sebastien Montella | Shailza Jolly | Simon Mille | Tahmid Hasan | Tianhao Shen | Tosin Adewumi | Vikas Raunak | Vipul Raheja | Vitaly Nikolaev | Vivian Tsai | Yacine Jernite | Ying Xu | Yisi Sang | Yixin Liu | Yufang Hou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Evaluations in machine learning rarely use the latest metrics, datasets, or human evaluation in favor of remaining compatible with prior work. The compatibility, often facilitated through leaderboards, thus leads to outdated but standardized evaluation practices. We pose that the standardization is taking place in the wrong spot. Evaluation infrastructure should enable researchers to use the latest methods and what should be standardized instead is how to incorporate these new evaluation advances. We introduce GEMv2, the new version of the Generation, Evaluation, and Metrics Benchmark which uses a modular infrastructure for dataset, model, and metric developers to benefit from each other’s work. GEMv2 supports 40 documented datasets in 51 languages, ongoing online evaluation for all datasets, and our interactive tools make it easier to add new datasets to the living benchmark.

2021

pdf
Underreporting of errors in NLG output, and what to do about it
Emiel van Miltenburg | Miruna Clinciu | Ondřej Dušek | Dimitra Gkatzia | Stephanie Inglis | Leo Leppänen | Saad Mahamood | Emma Manning | Stephanie Schoch | Craig Thomson | Luou Wen
Proceedings of the 14th International Conference on Natural Language Generation

We observe a severe under-reporting of the different kinds of errors that Natural Language Generation systems make. This is a problem, because mistakes are an important indicator of where systems should still be improved. If authors only report overall performance metrics, the research community is left in the dark about the specific weaknesses that are exhibited by ‘state-of-the-art’ research. Next to quantifying the extent of error under-reporting, this position paper provides recommendations for error identification, analysis and reporting.

pdf
Generation Challenges: Results of the Accuracy Evaluation Shared Task
Craig Thomson | Ehud Reiter
Proceedings of the 14th International Conference on Natural Language Generation

The Shared Task on Evaluating Accuracy focused on techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation techniques for this task, using very different approaches and techniques. The best-performing submissions did encouragingly well at this difficult task. However, all automatic submissions struggled to detect factual errors which are semantically or pragmatically complex (for example, based on incorrect computation or inference).

2020

pdf
Studying the Impact of Filling Information Gaps on the Output Quality of Neural Data-to-Text
Craig Thomson | Zhijie Zhao | Somayajulu Sripada
Proceedings of the 13th International Conference on Natural Language Generation

It is unfair to expect neural data-to-text to produce high quality output when there are gaps between system input data and information contained in the training text. Thomson et al. (2020) identify and narrow information gaps in Rotowire, a popular data-to-text dataset. In this paper, we describe a study which finds that a state-of-the-art neural data-to-text system produces higher quality output, according to the information extraction (IE) based metrics, when additional input data is carefully selected from this newly available source. It remains to be shown, however, whether IE metrics used in this study correlate well with humans in judging text quality.

pdf
A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems
Craig Thomson | Ehud Reiter
Proceedings of the 13th International Conference on Natural Language Generation

Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics.

pdf
Shared Task on Evaluating Accuracy
Ehud Reiter | Craig Thomson
Proceedings of the 13th International Conference on Natural Language Generation

We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts, specifically summaries of basketball games produced from basketball box score and other game data. We welcome submissions based on protocols for human evaluation, automatic metrics, as well as combinations of human evaluations and metrics.

pdf
SportSett:Basketball - A robust and maintainable data-set for Natural Language Generation
Craig Thomson | Ehud Reiter | Somayajulu Sripada
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation

2018

pdf
Comprehension Driven Document Planning in Natural Language Generation Systems
Craig Thomson | Ehud Reiter | Somayajulu Sripada
Proceedings of the 11th International Conference on Natural Language Generation

This paper proposes an approach to NLG system design which focuses on generating output text which can be more easily processed by the reader. Ways in which cognitive theory might be combined with existing NLG techniques are discussed and two simple experiments in content ordering are presented.
Search
Co-authors