2022
pdf
abs
Opening up Minds with Argumentative Dialogues
Youmna Farag
|
Charlotte Brand
|
Jacopo Amidei
|
Paul Piwek
|
Tom Stafford
|
Svetlana Stoyanchev
|
Andreas Vlachos
Findings of the Association for Computational Linguistics: EMNLP 2022
Recent research on argumentative dialogues has focused on persuading people to take some action, changing their stance on the topic of discussion, or winning debates. In this work, we focus on argumentative dialogues that aim to open up (rather than change) people’s minds to help them become more understanding to views that are unfamiliar or in opposition to their own convictions. To this end, we present a dataset of 183 argumentative dialogues about 3 controversial topics: veganism, Brexit and COVID-19 vaccination. The dialogues were collected using the Wizard of Oz approach, where wizards leverage a knowledge-base of arguments to converse with participants. Open-mindedness is measured before and after engaging in the dialogue using a questionnaire from the psychology literature, and success of the dialogue is measured as the change in the participant’s stance towards those who hold opinions different to theirs. We evaluate two dialogue models: a Wikipedia-based and an argument-based model. We show that while both models perform closely in terms of opening up minds, the argument-based model is significantly better on other dialogue properties such as engagement and clarity.
2020
pdf
abs
Identifying Annotator Bias: A new IRT-based method for bias identification
Jacopo Amidei
|
Paul Piwek
|
Alistair Willis
Proceedings of the 28th International Conference on Computational Linguistics
A basic step in any annotation effort is the measurement of the Inter Annotator Agreement (IAA). An important factor that can affect the IAA is the presence of annotator bias. In this paper we introduce a new interpretation and application of the Item Response Theory (IRT) to detect annotators’ bias. Our interpretation of IRT offers an original bias identification method that can be used to compare annotators’ bias and characterise annotation disagreement. Our method can be used to spot outlier annotators, improve annotation guidelines and provide a better picture of the annotation reliability. Additionally, because scales for IAA interpretation are not generally agreed upon, our bias identification method is valuable as a complement to the IAA value which can help with understanding the annotation disagreement.
2019
pdf
abs
Agreement is overrated: A plea for correlation to assess human evaluation reliability
Jacopo Amidei
|
Paul Piwek
|
Alistair Willis
Proceedings of the 12th International Conference on Natural Language Generation
Inter-Annotator Agreement (IAA) is used as a means of assessing the quality of NLG evaluation data, in particular, its reliability. According to existing scales of IAA interpretation – see, for example, Lommel et al. (2014), Liu et al. (2016), Sedoc et al. (2018) and Amidei et al. (2018a) – most data collected for NLG evaluation fail the reliability test. We confirmed this trend by analysing papers published over the last 10 years in NLG-specific conferences (in total 135 papers that included some sort of human evaluation study). Following Sampson and Babarczy (2008), Lommel et al. (2014), Joshi et al. (2016) and Amidei et al. (2018b), such phenomena can be explained in terms of irreducible human language variability. Using three case studies, we show the limits of considering IAA as the only criterion for checking evaluation reliability. Given human language variability, we propose that for human evaluation of NLG, correlation coefficients and agreement coefficients should be used together to obtain a better assessment of the evaluation data reliability. This is illustrated using the three case studies.
pdf
abs
The use of rating and Likert scales in Natural Language Generation human evaluation tasks: A review and some recommendations
Jacopo Amidei
|
Paul Piwek
|
Alistair Willis
Proceedings of the 12th International Conference on Natural Language Generation
Rating and Likert scales are widely used in evaluation experiments to measure the quality of Natural Language Generation (NLG) systems. We review the use of rating and Likert scales for NLG evaluation tasks published in NLG specialized conferences over the last ten years (135 papers in total). Our analysis brings to light a number of deviations from good practice in their use. We conclude with some recommendations about the use of such scales. Our aim is to encourage the appropriate use of evaluation methodologies in the NLG community.
2018
pdf
abs
Evaluation methodologies in Automatic Question Generation 2013-2018
Jacopo Amidei
|
Paul Piwek
|
Alistair Willis
Proceedings of the 11th International Conference on Natural Language Generation
In the last few years Automatic Question Generation (AQG) has attracted increasing interest. In this paper we survey the evaluation methodologies used in AQG. Based on a sample of 37 papers, our research shows that the systems’ development has not been accompanied by similar developments in the methodologies used for the systems’ evaluation. Indeed, in the papers we examine here, we find a wide variety of both intrinsic and extrinsic evaluation methodologies. Such diverse evaluation practices make it difficult to reliably compare the quality of different generation systems. Our study suggests that, given the rapidly increasing level of research in the area, a common framework is urgently needed to compare the performance of AQG systems and NLG systems more generally.
pdf
abs
Rethinking the Agreement in Human Evaluation Tasks
Jacopo Amidei
|
Paul Piwek
|
Alistair Willis
Proceedings of the 27th International Conference on Computational Linguistics
Human evaluations are broadly thought to be more valuable the higher the inter-annotator agreement. In this paper we examine this idea. We will describe our experiments and analysis within the area of Automatic Question Generation. Our experiments show how annotators diverge in language annotation tasks due to a range of ineliminable factors. For this reason, we believe that annotation schemes for natural language generation tasks that are aimed at evaluating language quality need to be treated with great care. In particular, an unchecked focus on reduction of disagreement among annotators runs the danger of creating generation goals that reward output that is more distant from, rather than closer to, natural human-like language. We conclude the paper by suggesting a new approach to the use of the agreement metrics in natural language generation evaluation tasks.
2017
pdf
abs
A model of suspense for narrative generation
Richard Doust
|
Paul Piwek
Proceedings of the 10th International Conference on Natural Language Generation
Most work on automatic generation of narratives, and more specifically suspenseful narrative, has focused on detailed domain-specific modelling of character psychology and plot structure. Recent work in computational linguistics on the automatic learning of narrative schemas suggests an alternative approach that exploits such schemas as a starting point for modelling and measuring suspense. We propose a domain-independent model for tracking suspense in a story which can be used to predict the audience’s suspense response on a sentence-by-sentence basis at the content determination stage of narrative generation. The model lends itself as the theoretical foundation for a suspense module that is compatible with alternative narrative generation theories. The proposal is evaluated by human judges’ normalised average scores correlate strongly with predicted values.
2016
pdf
abs
Measuring Non-cooperation in Dialogue
Brian Plüss
|
Paul Piwek
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
This paper introduces a novel method for measuring non-cooperation in dialogue. The key idea is that linguistic non-cooperation can be measured in terms of the extent to which dialogue participants deviate from conventions regarding the proper introduction and discharging of conversational obligations (e.g., the obligation to respond to a question). Previous work on non cooperation has focused mainly on non-linguistic task-related non-cooperation or modelled non-cooperation in terms of special rules describing non-cooperative behaviours. In contrast, we start from rules for normal/correct dialogue behaviour - i.e., a dialogue game - which in principle can be derived from a corpus of cooperative dialogues, and provide a quantitative measure for the degree to which participants comply with these rules. We evaluated the model on a corpus of political interviews, with encouraging results. The model predicts accurately the degree of cooperation for one of the two dialogue game roles (interviewer) and also the relative cooperation for both roles (i.e., which interlocutor in the conversation was most cooperative). Being able to measure cooperation has applications in many areas from the analysis - manual, semi and fully automatic - of natural language interactions to human-like virtual personal assistants, tutoring agents, sophisticated dialogue systems, and role-playing virtual humans.
pdf
Collecting Reliable Human Judgements on Machine-Generated Language: The Case of the QG-STEC Data
Keith Godwin
|
Paul Piwek
Proceedings of the 9th International Natural Language Generation conference
2013
pdf
Introducing a Corpus of Human-Authored Dialogue Summaries in Portuguese
Norton Trevisan Roman
|
Paul Piwek
|
Ariadne M. B. Rizzoni Carvalho
|
Alexandre Rossi Alvares
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013
2012
pdf
Planning Accessible Explanations for Entailments in OWL Ontologies
Tu Anh T. Nguyen
|
Richard Power
|
Paul Piwek
|
Sandra Williams
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference
2011
pdf
The CODA System for Monologue-to-Dialogue Generation
Svetlana Stoyanchev
|
Paul Piwek
Proceedings of the SIGDIAL 2011 Conference
pdf
Question Generation Shared Task and Evaluation Challenge – Status Report
Vasile Rus
|
Brendan Wyse
|
Paul Piwek
|
Mihai Lintean
|
Svetlana Stoyanchev
|
Cristian Moldovan
Proceedings of the 13th European Workshop on Natural Language Generation
pdf
Data-oriented Monologue-to-Dialogue Generation
Paul Piwek
|
Svetlana Stoyanchev
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
2010
pdf
Generating Expository Dialogue from Monologue: Motivation, Corpus and Preliminary Rules
Paul Piwek
|
Svetlana Stoyanchev
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
pdf
Harvesting Re-usable High-level Rules for Expository Dialogue Generation
Svetlana Stoyanchev
|
Paul Piwek
Proceedings of the 6th International Natural Language Generation Conference
pdf
The First Question Generation Shared Task Evaluation Challenge
Vasile Rus
|
Brendan Wyse
|
Paul Piwek
|
Mihai Lintean
|
Svetlana Stoyanchev
|
Christian Moldovan
Proceedings of the 6th International Natural Language Generation Conference
pdf
abs
Constructing the CODA Corpus: A Parallel Corpus of Monologues and Expository Dialogues
Svetlana Stoyanchev
|
Paul Piwek
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We describe the construction of the CODA corpus, a parallel corpus of monologues and expository dialogues. The dialogue part of the corpus consists of expository, i.e., information-delivering rather than dramatic, dialogues written by several acclaimed authors. The monologue part of the corpus is a paraphrase in monologue form of these dialogues by a human annotator. The annotator-written monologue preserves all information present in the original dialogue and does not introduce any new information that is not present in the original dialogue. The corpus was constructed as a resource for extracting rules for automated generation of dialogue from monologue. Using authored dialogues allows us to analyse the techniques used by accomplished writers for presenting information in the form of dialogue. The dialogues are annotated with dialogue acts and the monologues with rhetorical structure. We developed annotation and translation guidelines together with a custom-developed tool for carrying out translation, alignment and annotation of the dialogues. The final parallel CODA corpus consists of 1000 dialogue turns that are tagged with dialogue acts and aligned with monologue that expresses the same information and has been annotated with rhetorical structure relations.
2008
pdf
Book Reviews: Incremental Conceptualization for Language Production by Markus Guhe
Paul Piwek
Computational Linguistics, Volume 34, Number 1, March 2008
2007
pdf
Generating monologue and dialogue to present personalised medical information to patients
Sandra Williams
|
Paul Piwek
|
Richard Power
Proceedings of the Eleventh European Workshop on Natural Language Generation (ENLG 07)
2006
pdf
The Alligator theorem prover for dependent type systems: Description and proof samples
Paul Piwek
Proceedings of the Fifth International Workshop on Inference in Computational Semantics (ICoS-5)
2003
pdf
A Flexible Pragmatics-Driven Language Generator for Animated Agents
Paul Piwek
10th Conference of the European Chapter of the Association for Computational Linguistics
pdf
A Flexible Pragmatics-Driven Language Generator for Animated Agents
Paul Piwek
10th Conference of the European Chapter of the Association for Computational Linguistics
2002
pdf
What is NLG?
Roger Evans
|
Paul Piwek
|
Lynne Cahill
Proceedings of the International Natural Language Generation Conference
2000
pdf
A Formal Semantics for Generating and Editing Plurals
Paul Piwek
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics