Multi3Generation: Multitask, Multilingual, Multimodal Language Generation
Anabela Barreiro | José GC de Souza | Albert Gatt | Mehul Bhatt | Elena Lloret | Aykut Erdem | Dimitra Gkatzia | Helena Moniz | Irene Russo | Fabio Kepler | Iacer Calixto | Marcin Paprzycki | François Portet | Isabelle Augenstein | Mirela Alhasani
This paper presents the Multitask, Multilingual, Multimodal Language Generation COST Action – Multi3Generation (CA18231), an interdisciplinary network of research groups working on different aspects of language generation. This “meta-paper” will serve as reference for citations of the Action in future publications. It presents the objectives, challenges and a the links for the achieved outcomes.

Most NLG is Low-Resource: here’s what we can do about it
David M. Howcroft | Dimitra Gkatzia
Many domains and tasks in natural language generation (NLG) are inherently ‘low-resource’, where training data, tools and linguistic analyses are scarce. This poses a particular challenge to researchers and system developers in the era of machine-learning-driven NLG. In this position paper, we initially present the challenges researchers & developers often encounter when dealing with low-resource settings in NLG. We then argue that it is unsustainable to collect large aligned datasets or build large language models from scratch for every possible domain due to cost, labour, and time constraints, so researching and developing methods and resources for low-resource settings is vital. We then discuss current approaches to low-resource NLG, followed by proposed solutions and promising avenues for future work in NLG for low-resource settings.

Task2Dial: A Novel Task and Dataset for Commonsense-enhanced Task-based Dialogue Grounded in Documents
Carl Strathearn | Dimitra Gkatzia
This paper proposes a novel task on commonsense-enhanced task-based dialogue grounded in documents and describes the Task2Dial dataset, a novel dataset of document-grounded task-based dialogues, where an Information Giver (IG) provides instructions (by consulting a document) to an Information Follower (IF), so that the latter can successfully complete the task. In this unique setting, the IF can ask clarification questions which may not be grounded in the underlying document and require commonsense knowledge to be answered. The Task2Dial dataset poses new challenges: (1) its human reference texts show more lexical richness and variation than other document-grounded dialogue datasets; (2) generating from this set requires paraphrasing as instructional responses might have been modified from the underlying document; (3) requires commonsense knowledge, since questions might not necessarily be grounded in the document; (4) generating requires planning based on context, as task steps need to be provided in order. The Task2Dial dataset contains dialogues with an average 18.15 number of turns and 19.79 tokens per turn, as compared to 12.94 and 12 respectively in existing datasets. As such, learning from this dataset promises more natural, varied and less template-like system utterances.


Dimitra Gkatzia | Djamé Seddah
It’s Commonsense, isn’t it? Demystifying Human Evaluations in Commonsense-Enhanced NLG Systems
Miruna-Adriana Clinciu | Dimitra Gkatzia | Saad Mahamood
Common sense is an integral part of human cognition which allows us to make sound decisions, communicate effectively with others and interpret situations and utterances. Endowing AI systems with commonsense knowledge capabilities will help us get closer to creating systems that exhibit human intelligence. Recent efforts in Natural Language Generation (NLG) have focused on incorporating commonsense knowledge through large-scale pre-trained language models or by incorporating external knowledge bases. Such systems exhibit reasoning capabilities without common sense being explicitly encoded in the training set. These systems require careful evaluation, as they incorporate additional resources during training which adds additional sources of errors. Additionally, human evaluation of such systems can have significant variation, making it impossible to compare different systems and define baselines. This paper aims to demystify human evaluations of commonsense-enhanced NLG systems by proposing the Commonsense Evaluation Card (CEC), a set of recommendations for evaluation reporting of commonsense-enhanced NLG systems, underpinned by an extensive analysis of human evaluations reported in the recent literature.

Task2Dial Dataset: A Novel Dataset for Commonsense-enhanced Task-based Dialogue Grounded in Documents
Carl Strathearn | Dimitra Gkatzia
Chefbot: A Novel Framework for the Generation of Commonsense-enhanced Responses for Task-based Dialogue Systems
Carl Strathearn | Dimitra Gkatzia
Conversational systems aim to generate responses that are accurate, relevant and engaging, either through utilising neural end-to-end models or through slot filling. Human-to-human conversations are enhanced by not only the latest utterance of the interlocutor, but also by recalling relevant information about concepts/objects covered in the dialogue and integrating them into their responses. Such information may contain recent referred concepts, commonsense knowledge and more. A concrete scenario of such dialogues is the cooking scenario, i.e. when an artificial agent (personal assistant, robot, chatbot) and a human converse about a recipe. We will demo a novel system for commonsense enhanced response generation in the scenario of cooking, where the conversational system is able to not only provide directions for cooking step-by-step, but also display commonsense capabilities by offering explanations of how objects can be used and provide recommendations for replacing ingredients.

Underreporting of errors in NLG output, and what to do about it
Emiel van Miltenburg | Miruna Clinciu | Ondřej Dušek | Dimitra Gkatzia | Stephanie Inglis | Leo Leppänen | Saad Mahamood | Emma Manning | Stephanie Schoch | Craig Thomson | Luou Wen
We observe a severe under-reporting of the different kinds of errors that Natural Language Generation systems make. This is a problem, because mistakes are an important indicator of where systems should still be improved. If authors only report overall performance metrics, the research community is left in the dark about the specific weaknesses that are exhibited by ‘state-of-the-art’ research. Next to quantifying the extent of error under-reporting, this position paper provides recommendations for error identification, analysis and reporting.

CAPE: Context-Aware Private Embeddings for Private Language Learning
Richard Plant | Dimitra Gkatzia | Valerio Giuffrida
Neural language models have contributed to state-of-the-art results in a number of downstream applications including sentiment analysis, intent classification and others. However, obtaining text representations or embeddings using these models risks encoding personally identifiable information learned from language and context cues that may lead to privacy leaks. To ameliorate this issue, we propose Context-Aware Private Embeddings (CAPE), a novel approach which combines differential privacy and adversarial learning to preserve privacy during training of embeddings. Specifically, CAPE firstly applies calibrated noise through differential privacy to maintain the privacy of text representations by preserving the encoded semantic links while obscuring sensitive information. Next, CAPE employs an adversarial training regime that obscures identified private variables. Experimental results demonstrate that our proposed approach is more effective in reducing private information leakage than either single intervention, with approximately a 3% reduction in attacker performance compared to the best-performing current method.


Improving the Naturalness and Diversity of Referring Expression Generation models using Minimum Risk Training
Nikolaos Panagiaris | Emma Hart | Dimitra Gkatzia
In this paper we consider the problem of optimizing neural Referring Expression Generation (REG) models with sequence level objectives. Recently reinforcement learning (RL) techniques have been adopted to train deep end-to-end systems to directly optimize sequence-level objectives. However, there are two issues associated with RL training: (1) effectively applying RL is challenging, and (2) the generated sentences lack in diversity and naturalness due to deficiencies in the generated word distribution, smaller vocabulary size, and repetitiveness of frequent words and phrases. To alleviate these issues, we propose a novel strategy for training REG models, using minimum risk training (MRT) with maximum likelihood estimation (MLE) and we show that our approach outperforms RL w.r.t naturalness and diversity of the output. Specifically, our approach achieves an increase in CIDEr scores between 23%-57% in two datasets. We further demonstrate the robustness of the proposed method through a detailed comparison with different REG models.

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions
David M. Howcroft | Anya Belz | Miruna-Adriana Clinciu | Dimitra Gkatzia | Sadid A. Hasan | Saad Mahamood | Simon Mille | Emiel van Miltenburg | Sashank Santhanam | Verena Rieser
Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

Shubham Agarwal | Ondřej Dušek | Sebastian Gehrmann | Dimitra Gkatzia | Ioannis Konstas | Emiel Van Miltenburg | Sashank Santhanam
Mary Ellen Foster | Hendrik Buschmeier | Dimitra Gkatzia
Learning from limited datasets: Implications for Natural Language Generation and Human-Robot Interaction
Jekaterina Belakova | Dimitra Gkatzia
One of the most natural ways for human robot communication is through spoken language. Training human-robot interaction systems require access to large datasets which are expensive to obtain and labour intensive. In this paper, we describe an approach for learning from minimal data, using as a toy example language understanding in spoken dialogue systems. Understanding of spoken language is crucial because it has implications for natural language generation, i.e. correctly understanding a user’s utterance will lead to choosing the right response/action. Finally, we discuss implications for Natural Language Generation in Human-Robot Interaction.


Improving the Naturalness and Expressivity of Language Generation for Spanish
Cristina Barros | Dimitra Gkatzia | Elena Lloret
We present a flexible Natural Language Generation approach for Spanish, focused on the surface realisation stage, which integrates an inflection module in order to improve the naturalness and expressivity of the generated language. This inflection module inflects the verbs using an ensemble of trainable algorithms whereas the other types of words (e.g. nouns, determiners, etc) are inflected using hand-crafted rules. We show that our approach achieves 2% higher accuracy than two state-of-art inflection generation approaches. Furthermore, our proposed approach also predicts an extra feature: the inflection of the imperative mood, which was not taken into account by previous work. We also present a user evaluation, where we demonstrate that the proposed method significantly improves the perceived naturalness of the generated language.

Inflection Generation for Spanish Verbs using Supervised Learning
Cristina Barros | Dimitra Gkatzia | Elena Lloret
We present a novel supervised approach to inflection generation for verbs in Spanish. Our system takes as input the verb’s lemma form and the desired features such as person, number, tense, and is able to predict the appropriate grammatical conjugation. Even though our approach learns from fewer examples comparing to previous work, it is able to deal with all the Spanish moods (indicative, subjunctive and imperative) in contrast to previous work which only focuses on indicative and subjunctive moods. We show that in an intrinsic evaluation, our system achieves 99% accuracy, outperforming (although not significantly) two competitive state-of-art systems. The successful results obtained clearly indicate that our approach could be integrated into wider approaches related to text generation in Spanish.


Amy Isard | Verena Rieser | Dimitra Gkatzia
Natural Language Generation enhances human decision-making with uncertain information
Dimitra Gkatzia | Oliver Lemon | Verena Rieser
The REAL Corpus: A Crowd-Sourced Corpus of Human Generated and Evaluated Spatial References to Real-World Urban Scenes
Phil Bartie | William Mackaness | Dimitra Gkatzia | Verena Rieser
Our interest is in people’s capacity to efficiently and effectively describe geographic objects in urban scenes. The broader ambition is to develop spatial models capable of equivalent functionality able to construct such referring expressions. To that end we present a newly crowd-sourced data set of natural language references to objects anchored in complex urban scenes (In short: The REAL Corpus ― Referring Expressions Anchored Language). The REAL corpus contains a collection of images of real-world urban scenes together with verbal descriptions of target objects generated by humans, paired with data on how successful other people were able to identify the same object based on these descriptions. In total, the corpus contains 32 images with on average 27 descriptions per image and 3 verifications for each description. In addition, the corpus is annotated with a variety of linguistically motivated features. The paper highlights issues posed by collecting data using crowd-sourcing with an unrestricted input format, as well as using real-world urban scenes.


From the Virtual to the RealWorld: Referring to Objects in Real-World Spatial Scenes
Dimitra Gkatzia | Verena Rieser | Phil Bartie | William Mackaness
A Snapshot of NLG Evaluation Practices 2005 - 2014
Dimitra Gkatzia | Saad Mahamood
Generating and Evaluating Landmark-Based Navigation Instructions in Virtual Environments
Amanda Cercas Curry | Dimitra Gkatzia | Verena Rieser
A Game-Based Setup for Data Collection and Task-Based Evaluation of Uncertain Information Presentation
Dimitra Gkatzia | Amanda Cercas Curry | Verena Rieser | Oliver Lemon
Multi-adaptive Natural Language Generation using Principal Component Regression
Dimitra Gkatzia | Helen Hastie | Oliver Lemon
Comparing Multi-label Classification with Reinforcement Learning for Summarisation of Time-series Data
Dimitra Gkatzia | Helen Hastie | Oliver Lemon
Finding middle ground? Multi-objective Natural Language Generation from time-series data
Dimitra Gkatzia | Helen Hastie | Oliver Lemon
Generating Student Feedback from Time-Series Data Using Reinforcement Learning
Dimitra Gkatzia | Helen Hastie | Srinivasan Janarthanam | Oliver Lemon
