Anja Belz

Also published as: Anya Belz

2021

pdf bib abs
A Systematic Review of Reproducibility Research in Natural Language Processing
Anya Belz | Shubham Agarwal | Anastasia Shimorina | Ehud Reiter
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defined, measured and addressed, with diversity of views currently increasing rather than converging. With this focused contribution, we aim to provide a wide-angle, and as near as possible complete, snapshot of current work on reproducibility in NLP,

pdf bib
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)
Anya Belz | Shubham Agarwal | Yvette Graham | Ehud Reiter | Anastasia Shimorina
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

pdf bib
Proceedings of the 14th International Conference on Natural Language Generation
Anya Belz | Angela Fan | Ehud Reiter | Yaji Sripada
Proceedings of the 14th International Conference on Natural Language Generation

pdf bib abs
The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results
Anya Belz | Anastasia Shimorina | Shubham Agarwal | Ehud Reiter
Proceedings of the 14th International Conference on Natural Language Generation

The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation. Much of the work on reproducibility has so far focused on metric scores, with reproducibility of human evaluation results receiving far less attention. As part of a research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the first shared task on reproducibility of human evaluations, ReproGen 2021. This paper describes the shared task in detail, summarises results from each of the reproduction studies submitted, and provides further comparative analysis of the results. Out of nine initial team registrations, we received submissions from four teams. Meta-analysis of the four reproduction studies revealed varying degrees of reproducibility, and allowed very tentative first conclusions about what types of evaluation tend to have better reproducibility.

pdf bib abs
Another PASS: A Reproduction Study of the Human Evaluation of a Football Report Generation System
Simon Mille | Thiago Castro Ferreira | Anya Belz | Brian Davis
Proceedings of the 14th International Conference on Natural Language Generation

This paper reports results from a reproduction study in which we repeated the human evaluation of the PASS Dutch-language football report generation system (van der Lee et al., 2017). The work was carried out as part of the ReproGen Shared Task on Reproducibility of Human Evaluations in NLG, in Track A (Paper 1). We aimed to repeat the original study exactly, with the main difference that a different set of evaluators was used. We describe the study design, present the results from the original and the reproduction study, and then compare and analyse the differences between the two sets of results. For the two ‘headline’ results of average Fluency and Clarity, we find that in both studies, the system was rated more highly for Clarity than for Fluency, and Clarity had higher standard deviation. Clarity and Fluency ratings were higher, and their standard deviations lower, in the reproduction study than in the original study by substantial margins. Clarity had a higher degree of reproducibility than Fluency, as measured by the coefficient of variation. Data and code are publicly available.

pdf bib abs
A Reproduction Study of an Annotation-based Human Evaluation of MT Outputs
Maja Popović | Anya Belz
Proceedings of the 14th International Conference on Natural Language Generation

In this paper we report our reproduction study of the Croatian part of an annotation-based human evaluation of machine-translated user reviews (Popovic, 2020). The work was carried out as part of the ReproGen Shared Task on Reproducibility of Human Evaluation in NLG. Our aim was to repeat the original study exactly, except for using a different set of evaluators. We describe the experimental design, characterise differences between original and reproduction study, and present the results from each study, along with analysis of the similarity between them. For the six main evaluation results of Major/Minor/All Comprehension error rates and Major/Minor/All Adequacy error rates, we find that (i) 4/6 system rankings are the same in both studies, (ii) the relative differences between systems are replicated well for Major Comprehension and Adequacy (Pearson’s > 0.9), but not for the corresponding Minor error rates (Pearson’s 0.36 for Adequacy, 0.67 for Comprehension), and (iii) the individual system scores for both types of Minor error rates had a higher degree of reproducibility than the corresponding Major error rates. We also examine inter-annotator agreement and compare the annotations obtained in the original and reproduction studies.

2020

pdf bib abs
The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation Results
Simon Mille | Anya Belz | Bernd Bohnet | Thiago Castro Ferreira | Yvette Graham | Leo Wanner
Proceedings of the Third Workshop on Multilingual Surface Realisation

This paper presents results from the Third Shared Task on Multilingual Surface Realisation (SR’20) which was organised as part of the COLING’20 Workshop on Multilingual Surface Realisation. As in SR’18 and SR’19, the shared task comprised two tracks: (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (2) a Deep Track where additionally, functional words and morphological information were removed. Moreover, each track had two subtracks: (a) restricted-resource, where only the data provided or approved as part of a track could be used for training models, and (b) open-resource, where any data could be used. The Shallow Track was offered in 11 languages, whereas the Deep Track in 3 ones. Systems were evaluated using both automatic metrics and direct assessment by human evaluators in terms of Readability and Meaning Similarity to reference outputs. We present the evaluation results, along with descriptions of the SR’19 tracks, data and evaluation methods, as well as brief summaries of the participating systems. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

pdf bib abs
Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing
Anya Belz | Simon Mille | David M. Howcroft
Proceedings of the 13th International Conference on Natural Language Generation

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs. This has serious implications for reproducibility testing and meta-evaluation, in particular given that human evaluation is considered the gold standard against which the trustworthiness of automatic metrics is gauged. %and merging others, as well as deciding which evaluations should be able to reproduce each other’s results. Using examples from NLG, we propose a classification system for evaluations based on disentangling (i) what is being evaluated (which aspect of quality), and (ii) how it is evaluated in specific (a) evaluation modes and (b) experimental designs. We show that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing.

pdf bib abs
ReproGen: Proposal for a Shared Task on Reproducibility of Human Evaluations in NLG
Anya Belz | Shubham Agarwal | Anastasia Shimorina | Ehud Reiter
Proceedings of the 13th International Conference on Natural Language Generation

Across NLP, a growing body of work is looking at the issue of reproducibility. However, replicability of human evaluation experiments and reproducibility of their results is currently under-addressed, and this is of particular concern for NLG where human evaluations are the norm. This paper outlines our ideas for a shared task on reproducibility of human evaluations in NLG which aims (i) to shed light on the extent to which past NLG evaluations are replicable and reproducible, and (ii) to draw conclusions regarding how evaluations can be designed and reported to increase replicability and reproducibility. If the task is run over several years, we hope to be able to document an overall increase in levels of replicability and reproducibility over time.

2019

pdf bib abs
Conceptualisation and Annotation of Drug Nonadherence Information for Knowledge Extraction from Patient-Generated Texts
Anja Belz | Richard Hoile | Elizabeth Ford | Azam Mullick
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Approaches to knowledge extraction (KE) in the health domain often start by annotating text to indicate the knowledge to be extracted, and then use the annotated text to train systems to perform the KE. This may work for annotat- ing named entities or other contiguous noun phrases (drugs, some drug effects), but be- comes increasingly difficult when items tend to be expressed across multiple, possibly non- contiguous, syntactic constituents (e.g. most descriptions of drug effects in user-generated text). Other issues include that it is not al- ways clear how annotations map to actionable insights, or how they scale up to, or can form part of, more complex KE tasks. This paper reports our efforts in developing an approach to extracting knowledge about drug nonadher- ence from health forums which led us to con- clude that development cannot proceed in sep- arate steps but that all aspects—from concep- tualisation to annotation scheme development, annotation, KE system training and knowl- edge graph instantiation—are interdependent and need to be co-developed. Our aim in this paper is two-fold: we describe a generally ap- plicable framework for developing a KE ap- proach, and present a specific KE approach, developed with the framework, for the task of gathering information about antidepressant drug nonadherence. We report the conceptual- isation, the annotation scheme, the annotated corpus, and an analysis of annotated texts.

pdf bib
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)
Simon Mille | Anja Belz | Bernd Bohnet | Yvette Graham | Leo Wanner
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

pdf bib abs
The Second Multilingual Surface Realisation Shared Task (SR’19): Overview and Evaluation Results
Simon Mille | Anja Belz | Bernd Bohnet | Yvette Graham | Leo Wanner
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

We report results from the SR’19 Shared Task, the second edition of a multilingual surface realisation task organised as part of the EMNLP’19 Workshop on Multilingual Surface Realisation. As in SR’18, the shared task comprised two tracks with different levels of complexity: (a) a shallow track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (b) a deep track where additionally, functional words and morphological information were removed. The shallow track was offered in eleven, and the deep track in three languages. Systems were evaluated (a) automatically, using a range of intrinsic metrics, and (b) by human judges in terms of readability and meaning similarity. This report presents the evaluation results, along with descriptions of the SR’19 tracks, data and evaluation methods. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

2018

pdf bib
Proceedings of the First Workshop on Multilingual Surface Realisation
Simon Mille | Anja Belz | Bernd Bohnet | Emily Pitler | Leo Wanner
Proceedings of the First Workshop on Multilingual Surface Realisation

pdf bib abs
The First Multilingual Surface Realisation Shared Task (SR’18): Overview and Evaluation Results
Simon Mille | Anja Belz | Bernd Bohnet | Yvette Graham | Emily Pitler | Leo Wanner
Proceedings of the First Workshop on Multilingual Surface Realisation

We report results from the SR’18 Shared Task, a new multilingual surface realisation task organised as part of the ACL’18 Workshop on Multilingual Surface Realisation. As in its English-only predecessor task SR’11, the shared task comprised two tracks with different levels of complexity: (a) a shallow track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (b) a deep track where additionally, functional words and morphological information were removed. The shallow track was offered in ten, and the deep track in three languages. Systems were evaluated (a) automatically, using a range of intrinsic metrics, and (b) by human judges in terms of readability and meaning similarity. This report presents the evaluation results, along with descriptions of the SR’18 tracks, data and evaluation methods. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

pdf bib abs
SpatialVOC2K: A Multilingual Dataset of Images with Annotations and Features for Spatial Relations between Objects
Anja Belz | Adrian Muscat | Pierre Anguill | Mouhamadou Sow | Gaétan Vincent | Yassine Zinessabah
Proceedings of the 11th International Conference on Natural Language Generation

We present SpatialVOC2K, the first multilingual image dataset with spatial relation annotations and object features for image-to-text generation, built using 2,026 images from the PASCAL VOC2008 dataset. The dataset incorporates (i) the labelled object bounding boxes from VOC2008, (ii) geometrical, language and depth features for each object, and (iii) for each pair of objects in both orders, (a) the single best preposition and (b) the set of possible prepositions in the given language that describe the spatial relationship between the two objects. Compared to previous versions of the dataset, we have roughly doubled the size for French, and completely reannotated as well as increased the size of the English portion, providing single best prepositions for English for the first time. Furthermore, we have added explicit 3D depth features for objects. We are releasing our dataset for free reuse, along with evaluation tools to enable comparative evaluation.

pdf bib abs
Adding the Third Dimension to Spatial Relation Detection in 2D Images
Brandon Birmingham | Adrian Muscat | Anja Belz
Proceedings of the 11th International Conference on Natural Language Generation

Detection of spatial relations between objects in images is currently a popular subject in image description research. A range of different language and geometric object features have been used in this context, but methods have not so far used explicit information about the third dimension (depth), except when manually added to annotations. The lack of such information hampers detection of spatial relations that are inherently 3D. In this paper, we use a fully automatic method for creating a depth map of an image and derive several different object-level depth features from it which we add to an existing feature set to test the effect on spatial relation detection. We show that performance increases are obtained from adding depth features in all scenarios tested.

pdf bib abs
Underspecified Universal Dependency Structures as Inputs for Multilingual Surface Realisation
Simon Mille | Anja Belz | Bernd Bohnet | Leo Wanner
Proceedings of the 11th International Conference on Natural Language Generation

In this paper, we present the datasets used in the Shallow and Deep Tracks of the First Multilingual Surface Realisation Shared Task (SR’18). For the Shallow Track, data in ten languages has been released: Arabic, Czech, Dutch, English, Finnish, French, Italian, Portuguese, Russian and Spanish. For the Deep Track, data in three languages is made available: English, French and Spanish. We describe in detail how the datasets were derived from the Universal Dependencies V2.0, and report on an evaluation of the Deep Track input quality. In addition, we examine the motivation for, and likely usefulness of, deriving NLG inputs from annotations in resources originally developed for Natural Language Understanding (NLU), and assess whether the resulting inputs supply enough information of the right kind for the final stage in the NLG process.

2017

pdf bib
Proceedings of the Sixth Workshop on Vision and Language
Anya Belz | Erkut Erdem | Katerina Pastra | Krystian Mikolajczyk
Proceedings of the Sixth Workshop on Vision and Language

pdf bib abs
Shared Task Proposal: Multilingual Surface Realization Using Universal Dependency Trees
Simon Mille | Bernd Bohnet | Leo Wanner | Anja Belz
Proceedings of the 10th International Conference on Natural Language Generation

We propose a shared task on multilingual Surface Realization, i.e., on mapping unordered and uninflected universal dependency trees to correctly ordered and inflected sentences in a number of languages. A second deeper input will be available in which, in addition, functional words, fine-grained PoS and morphological information will be removed from the input trees. The first shared task on Surface Realization was carried out in 2011 with a similar setup, with a focus on English. We think that it is time for relaunching such a shared task effort in view of the arrival of Universal Dependencies annotated treebanks for a large number of languages on the one hand, and the increasing dominance of Deep Learning, which proved to be a game changer for NLP, on the other hand.

2016

pdf bib
Proceedings of the 5th Workshop on Vision and Language
Anya Belz | Erkut Erdem | Krystian Mikolajczyk | Katerina Pastra
Proceedings of the 5th Workshop on Vision and Language

pdf bib
Exploring Different Preposition Sets, Models and Feature Sets in Automatic Generation of Spatial Image Descriptions
Anja Belz | Adrian Muscat | Brandon Birmingham
Proceedings of the 5th Workshop on Vision and Language

pdf bib abs
Analysis of Twitter Data for Postmarketing Surveillance in Pharmacovigilance
Julie Pain | Jessie Levacher | Adam Quinquenel | Anja Belz
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Postmarketing surveillance (PMS) has the vital aim to monitor effects of drugs after release for use by the general population, but suffers from under-reporting and limited coverage. Automatic methods for detecting drug effect reports, especially for social media, could vastly increase the scope of PMS. Very few automatic PMS methods are currently available, in particular for the messy text types encountered on Twitter. In this paper we describe first results for developing PMS methods specifically for tweets. We describe the corpus of 125,669 tweets we have created and annotated to train and test the tools. We find that generic tools perform well for tweet-level language identification and tweet-level sentiment analysis (both 0.94 F1-Score). For detection of effect mentions we are able to achieve 0.87 F1-Score, while effect-level adverse-vs.-beneficial analysis proves harder with an F1-Score of 0.64. Among other things, our results indicate that MetaMap semantic types provide a very promising basis for identifying drug effect mentions in tweets.

pdf bib
Effect of Data Annotation, Feature Selection and Model Choice on Spatial Description Generation in French
Anja Belz | Adrian Muscat | Brandon Birmingham | Jessie Levacher | Julie Pain | Adam Quinquenel
Proceedings of the 9th International Natural Language Generation conference

2015

pdf bib
Describing Spatial Relationships between Objects in Images in English and French
Anja Belz | Adrian Muscat | Maxime Aberton | Sami Benjelloun
Proceedings of the Fourth Workshop on Vision and Language

pdf bib
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)
Anya Belz | Albert Gatt | François Portet | Matthew Purver
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

pdf bib
Generating Descriptions of Spatial Relations between Objects in Images
Adrian Muscat | Anja Belz
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

2014

pdf bib abs
A Comparative Evaluation Methodology for NLG in Interactive Systems
Helen Hastie | Anja Belz
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Interactive systems have become an increasingly important type of application for deployment of NLG technology over recent years. At present, we do not yet have commonly agreed terminology or methodology for evaluating NLG within interactive systems. In this paper, we take steps towards addressing this gap by presenting a set of principles for designing new evaluations in our comparative evaluation methodology. We start with presenting a categorisation framework, giving an overview of different categories of evaluation measures, in order to provide standard terminology for categorising existing and new evaluation techniques. Background on existing evaluation methodologies for NLG and interactive systems is presented. The comparative evaluation methodology is presented. Finally, a methodology for comparative evaluation of NLG components embedded within interactive systems is presented in terms of the comparative evaluation methodology, using a specific task for illustrative purposes.

pdf bib
The Last 10 Metres: Using Visual Analysis and Verbal Communication in Guiding Visually Impaired Smartphone Users to Entrances
Anja Belz | Anil Bharath
Proceedings of the Third Workshop on Vision and Language

2012

pdf bib
The Surface Realisation Task: Recent Developments and Future Plans
Anja Belz | Bernd Bohnet | Simon Mille | Leo Wanner | Michael White
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference

pdf bib abs
A Repository of Data and Evaluation Resources for Natural Language Generation
Anja Belz | Albert Gatt
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Starting in 2007, the field of natural language generation (NLG) has organised shared-task evaluation events every year, under the Generation Challenges umbrella. In the course of these shared tasks, a wealth of data has been created, along with associated task definitions and evaluation regimes. In other contexts too, sharable NLG data is now being created. In this paper, we describe the online repository that we have created as a one-stop resource for obtaining NLG task materials, both from Generation Challenges tasks and from other sources, where the set of materials provided for each task consists of (i) task definition, (ii) input and output data, (iii) evaluation software, (iv) documentation, and (v) publications reporting previous results.

pdf bib abs
LG-Eval: A Toolkit for Creating Online Language Evaluation Experiments
Eric Kow | Anja Belz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we describe the LG-Eval toolkit for creating online language evaluation experiments. LG-Eval is the direct result of our work setting up and carrying out the human evaluation experiments in several of the Generation Challenges shared tasks. It provides tools for creating experiments with different kinds of rating tools, allocating items to evaluators, and collecting the evaluation scores.

2011

pdf bib
Discrete vs. Continuous Rating Scales for Language Evaluation in NLP
Anja Belz | Eric Kow
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Unsupervised Alignment of Comparable Data and Text Resources
Anja Belz | Eric Kow
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

pdf bib
Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop
Anja Belz | Roger Evans | Albert Gatt | Kristina Striegnitz
Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop

pdf bib
Generation Challenges 2011 Preface
Anja Belz | Albert Gatt | Alexander Koller | Kristina Striegnitz
Proceedings of the 13th European Workshop on Natural Language Generation

2010

pdf bib abs
A Game-based Approach to Transcribing Images of Text
Khalil Dahab | Anja Belz
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Creating language resources is expensive and time-consuming, and this forms a bottleneck in the development of language technology, for less-studied non-European languages in particular. The recent internet phenomenon of crowd-sourcing offers a cost-effective and potentially fast way of overcoming such language resource acquisition bottlenecks. We present a methodology that takes as its input scanned documents of typed or hand-written text, and produces transcriptions of the text as its output. Instead of using Optical Character Recognition (OCR) technology, the methodology is game-based and produces such transcriptions as a by-product. The approach is intended particularly for languages for which language technology and resources are scarce and reliable OCR technology may not exist. It can be used in place of OCR for transcribing individual documents, or to create corpora of paired images and transcriptions required to train OCR tools. We present Minefield, a prototype implementation of the approach which is currently collecting Arabic transcriptions.

pdf bib
Construction of bilingual multimodal corpora of referring expressions in collaborative problem solving
Takenobu Tokunaga | Ryu Iida | Masaaki Yasuhara | Asuka Terai | David Morris | Anja Belz
Proceedings of the Eighth Workshop on Asian Language Resouces

pdf bib
Comparing Rating Scales and Preference Judgements in Language Evaluation
Anja Belz | Eric Kow
Proceedings of the 6th International Natural Language Generation Conference

pdf bib
Extracting Parallel Fragments from Comparable Corpora for Data-to-text Generation
Anja Belz | Eric Kow
Proceedings of the 6th International Natural Language Generation Conference

pdf bib
Generation Challenges 2010 Preface
Anja Belz | Albert Gatt | Alexander Koller
Proceedings of the 6th International Natural Language Generation Conference

pdf bib
The GREC Challenges 2010: Overview and Evaluation Results
Anja Belz | Eric Kow
Proceedings of the 6th International Natural Language Generation Conference

pdf bib
Finding Common Ground: Towards a Surface Realisation Shared Task
Anja Belz | Mike White | Josef van Genabith | Deirdre Hogan | Amanda Stent
Proceedings of the 6th International Natural Language Generation Conference