2024
pdf
abs
Intrinsic Task-based Evaluation for Referring Expression Generation
Guanyi Chen
|
Fahime Same
|
Kees Van Deemter
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recently, a human evaluation study of Referring Expression Generation (REG) models had an unexpected conclusion: on WEBNLG, Referring Expressions (REs) generated by the state-of-the-art neural models were not only indistinguishable from the REs in WEBNLG but also from the REs generated by a simple rule-based system. Here, we argue that this limitation could stem from the use of a purely ratings-based human evaluation (which is a common practice in Natural Language Generation). To investigate these issues, we propose an intrinsic task-based evaluation for REG models, in which, in addition to rating the quality of REs, participants were asked to accomplish two meta-level tasks. One of these tasks concerns the referential success of each RE; the other task asks participants to suggest a better alternative for each RE. The outcomes suggest that, in comparison to previous evaluations, the new evaluation protocol assesses the performance of each REG model more comprehensively and makes the participants’ ratings more reliable and discriminable.
pdf
abs
Generating Hotel Highlights from Unstructured Text using LLMs
Srinivas Ramesh Kamath
|
Fahime Same
|
Saad Mahamood
Proceedings of the 17th International Natural Language Generation Conference
We describe our implementation and evaluation of the Hotel Highlights system which has been deployed live by trivago. This system leverages a large language model (LLM) to generate a set of highlights from accommodation descriptions and reviews, enabling travellers to quickly understand its unique aspects. In this paper, we discuss our motivation for building this system and the human evaluation we conducted, comparing the generated highlights against the source input to assess the degree of hallucinations and/or contradictions present. Finally, we outline the lessons learned and the improvements needed.
pdf
abs
Reference and discourse structure annotation of elicited chat continuations in German
Katja Jasinskaja
|
Yuting Li
|
Fahime Same
|
David Uerlings
Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII)
We present the construction of a German chat corpus in an experimental setting. Our primary objective is to advance the methodology of discourse continuation for dialogue. The corpus features a fine-grained, multi-layer annotation of referential expressions and coreferential chains. Additionally, we have developed a comprehensive annotation scheme for coherence relations to describe discourse structure.
pdf
abs
Experimental versus In-Corpus Variation in Referring Expression Choice
T. Mark Ellison
|
Fahime Same
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In this paper, we compare the results of three studies. The first explored feature-conditioned distributions of referring expression (RE) forms in the original corpus from which the contexts were taken. The second is a crowdsourcing study in which we asked participants to express entities within a pre-existing context, given fully specified referents. The third study replicates the crowdsourcing experiment using Large Language Models (LLMs). We evaluate how well the corpus itself can model the variation found when multiple informants (either human participants or LLMs) choose REs in the same contexts. We measure the similarity of the conditional distributions of form categories using the Jensen-Shannon Divergence metric and Description Length metric. We find that the experimental methodology introduces substantial noise, but by taking this noise into account, we can model the variation captured from the corpus and RE form choices made during experiments. Furthermore, we compared the three conditional distributions over the corpus, the human experimental results, and the GPT models. Against our expectations, the divergence is greatest between the corpus and the GPT model.
2023
pdf
abs
Models of reference production: How do they withstand the test of time?
Fahime Same
|
Guanyi Chen
|
Kees van Deemter
Proceedings of the 16th International Natural Language Generation Conference
In recent years, many NLP studies have focused solely on performance improvement. In this work, we focus on the linguistic and scientific aspects of NLP. We use the task of generating referring expressions in context (REG-in-context) as a case study and start our analysis from GREC, a comprehensive set of shared tasks in English that addressed this topic over a decade ago. We ask what the performance of models would be if we assessed them (1) on more realistic datasets, and (2) using more advanced methods. We test the models using different evaluation metrics and feature selection experiments. We conclude that GREC can no longer be regarded as offering a reliable assessment of models’ ability to mimic human reference production, because the results are highly impacted by the choice of corpus and evaluation metrics. Our results also suggest that pre-trained language models are less dependent on the choice of corpus than classic Machine Learning models, and therefore make more robust class predictions.
pdf
abs
Multi-layered Annotation of Conversation-like Narratives in German
Magdalena Repp
|
Petra B. Schumacher
|
Fahime Same
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)
This work presents two corpora based on excerpts from two novels with an informal narration style in German. We performed fine-grained multi-layer annotations of animate referents, assigning local and global prominence-lending features to the annotated referring expressions. In addition, our corpora include annotations of intra-sentential segments, which can serve as a more reliable unit of length measurement. Furthermore, we present two exemplary studies demonstrating how to use these corpora.
2022
pdf
abs
Non-neural Models Matter: a Re-evaluation of Neural Referring Expression Generation Systems
Fahime Same
|
Guanyi Chen
|
Kees Van Deemter
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In recent years, neural models have often outperformed rule-based and classic Machine Learning approaches in NLG. These classic approaches are now often disregarded, for example when new neural models are evaluated. We argue that they should not be overlooked, since, for some tasks, well-designed non-neural approaches achieve better performance than neural ones. In this paper, the task of generating referring expressions in linguistic context is used as an example. We examined two very different English datasets (WEBNLG and WSJ), and evaluated each algorithm using both automatic and human evaluations. Overall, the results of these evaluations suggest that rule-based systems with simple rule sets achieve on-par or better performance on both datasets compared to state-of-the-art neural REG systems. In the case of the more realistic dataset, WSJ, a machine learning-based system with well-designed linguistic features performed best. We hope that our work can encourage researchers to consider non-neural models in future.
pdf
Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset
Guanyi Chen
|
Fahime Same
|
Kees Van Deemter
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems
pdf
abs
Constructing Distributions of Variation in Referring Expression Type from Corpora for Model Evaluation
T. Mark Ellison
|
Fahime Same
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The generation of referring expressions (REs) is a non-deterministic task. However, the algorithms for the generation of REs are standardly evaluated against corpora of written texts which include only one RE per each reference. Our goal in this work is firstly to reproduce one of the few studies taking the distributional nature of the RE generation into account. We add to this work, by introducing a method for exploring variation in human RE choice on the basis of longitudinal corpora - substantial corpora with a single human judgement (in the process of composition) per RE. We focus on the prediction of RE types, proper name, description and pronoun. We compare evaluations made against distributions over these types with evaluations made against parallel human judgements. Our results show agreement in the evaluation of learning algorithms against distributions constructed from parallel human evaluations and from longitudinal data.
2021
pdf
abs
What can Neural Referential Form Selectors Learn?
Guanyi Chen
|
Fahime Same
|
Kees van Deemter
Proceedings of the 14th International Conference on Natural Language Generation
Despite achieving encouraging results, neural Referring Expression Generation models are often thought to lack transparency. We probed neural Referential Form Selection (RFS) models to find out to what extent the linguistic features influencing the RE form are learned and captured by state-of-the-art RFS models. The results of 8 probing tasks show that all the defined features were learned to some extent. The probing tasks pertaining to referential status and syntactic position exhibited the highest performance. The lowest performance was achieved by the probing models designed to predict discourse structure properties beyond the sentence level.
2020
pdf
abs
Computational Interpretations of Recency for the Choice of Referring Expressions in Discourse
Fahime Same
|
Kees van Deemter
Proceedings of the First Workshop on Computational Approaches to Discourse
First, we discuss the most common linguistic perspectives on the concept of recency and propose a taxonomy of recency metrics employed in Machine Learning studies for choosing the form of referring expressions in discourse context. We then report on a Multi-Layer Perceptron study and a Sequential Forward Search experiment, followed by Bayes Factor analysis of the outcomes. The results suggest that recency metrics counting paragraphs and sentences contribute to referential choice prediction more than other recency-related metrics. Based on the results of our analysis, we argue that, sensitivity to discourse structure is important for recency metrics used in determining referring expression forms.
pdf
abs
A Linguistic Perspective on Reference: Choosing a Feature Set for Generating Referring Expressions in Context
Fahime Same
|
Kees van Deemter
Proceedings of the 28th International Conference on Computational Linguistics
This paper reports on a structured evaluation of feature-based Machine Learning algorithms for selecting the form of a referring expression in discourse context. Based on this evaluation, we selected seven feature sets from the literature, amounting to 65 distinct linguistic features. The features were then grouped into 9 broad classes. After building Random Forest models, we used Feature Importance Ranking and Sequential Forward Search methods to assess the “importance” of the features. Combining the results of the two methods, we propose a consensus feature set. The 6 features in our consensus set come from 4 different classes, namely grammatical role, inherent features of the referent, antecedent form and recency.