International Natural Language Generation Conference (2020)


bib (full) Proceedings of the 13th International Conference on Natural Language Generation

pdf bib
Proceedings of the 13th International Conference on Natural Language Generation
Brian Davis | Yvette Graham | John Kelleher | Yaji Sripada

pdf bib
Memory Attentive Fusion: External Language Model Integration for Transformer-based Sequence-to-Sequence Model
Mana Ihori | Ryo Masumura | Naoki Makishima | Tomohiro Tanaka | Akihiko Takashima | Shota Orihashi

This paper presents a novel fusion method for integrating an external language model (LM) into the Transformer based sequence-to-sequence (seq2seq) model. While paired data are basically required to train the seq2seq model, the external LM can be trained with only unpaired data. Thus, it is important to leverage memorized knowledge in the external LM for building the seq2seq model, since it is hard to prepare a large amount of paired data. However, the existing fusion methods assume that the LM is integrated with recurrent neural network-based seq2seq models instead of the Transformer. Therefore, this paper proposes a fusion method that can explicitly utilize network structures in the Transformer. The proposed method, called memory attentive fusion, leverages the Transformer-style attention mechanism that repeats source-target attention in a multi-hop manner for reading the memorized knowledge in the LM. Our experiments on two text-style conversion tasks demonstrate that the proposed method performs better than conventional fusion methods.

pdf bib
Arabic NLG Language Functions
Wael Abed | Ehud Reiter

The Arabic language has very limited supports from NLG researchers. In this paper, we explain the challenges of the core grammar, provide a lexical resource, and implement the first language functions for the Arabic language. We did a human evaluation to evaluate our functions in generating sentences from the NADA Corpus.

Generating Intelligible Plumitifs Descriptions: Use Case Application with Ethical Considerations
David Beauchemin | Nicolas Garneau | Eve Gaumond | Pierre-Luc Déziel | Richard Khoury | Luc Lamontagne

Plumitifs (dockets) were initially a tool for law clerks. Nowadays, they are used as summaries presenting all the steps of a judicial case. Information concerning parties’ identity, jurisdiction in charge of administering the case, and some information relating to the nature and the course of the preceding are available through plumitifs. They are publicly accessible but barely understandable; they are written using abbreviations and referring to provisions from the Criminal Code of Canada, which makes them hard to reason about. In this paper, we propose a simple yet efficient multi-source language generation architecture that leverages both the plumitif and the Criminal Code’s content to generate intelligible plumitifs descriptions. It goes without saying that ethical considerations rise with these sensitive documents made readable and available at scale, legitimate concerns that we address in this paper. This is, to the best of our knowledge, the first application of plumitifs descriptions generation made available for French speakers along with an ethical discussion about the topic.

RecipeNLG: A Cooking Recipes Dataset for Semi-Structured Text Generation
Michał Bień | Michał Gilski | Martyna Maciejewska | Wojciech Taisner | Dawid Wisniewski | Agnieszka Lawrynowicz

Semi-structured text generation is a non-trivial problem. Although last years have brought lots of improvements in natural language generation, thanks to the development of neural models trained on large scale datasets, these approaches still struggle with producing structured, context- and commonsense-aware texts. Moreover, it is not clear how to evaluate the quality of generated texts. To address these problems, we introduce RecipeNLG – a novel dataset of cooking recipes. We discuss the data collection process and the relation between the semi-structured texts and cooking recipes. We use the dataset to approach the problem of generating recipes. Finally, we make use of multiple metrics to evaluate the generated recipes.

Controlled Text Generation with Adversarial Learning
Federico Betti | Giorgia Ramponi | Massimo Piccardi

In recent years, generative adversarial networks (GANs) have started to attain promising results also in natural language generation. However, the existing models have paid limited attention to the semantic coherence of the generated sentences. For this reason, in this paper we propose a novel network – the Controlled TExt generation Relational Memory GAN (CTERM-GAN) – that uses an external input to influence the coherence of sentence generation. The network is composed of three main components: a generator based on a Relational Memory conditioned on the external input; a syntactic discriminator which learns to discriminate between real and generated sentences; and a semantic discriminator which assesses the coherence with the external conditioning. Our experiments on six probing datasets have showed that the model has been able to achieve interesting results, retaining or improving the syntactic quality of the generated sentences while significantly improving their semantic coherence with the given input.

Studying the Impact of Filling Information Gaps on the Output Quality of Neural Data-to-Text
Craig Thomson | Zhijie Zhao | Somayajulu Sripada

It is unfair to expect neural data-to-text to produce high quality output when there are gaps between system input data and information contained in the training text. Thomson et al. (2020) identify and narrow information gaps in Rotowire, a popular data-to-text dataset. In this paper, we describe a study which finds that a state-of-the-art neural data-to-text system produces higher quality output, according to the information extraction (IE) based metrics, when additional input data is carefully selected from this newly available source. It remains to be shown, however, whether IE metrics used in this study correlate well with humans in judging text quality.

Improving the Naturalness and Diversity of Referring Expression Generation models using Minimum Risk Training
Nikolaos Panagiaris | Emma Hart | Dimitra Gkatzia

In this paper we consider the problem of optimizing neural Referring Expression Generation (REG) models with sequence level objectives. Recently reinforcement learning (RL) techniques have been adopted to train deep end-to-end systems to directly optimize sequence-level objectives. However, there are two issues associated with RL training: (1) effectively applying RL is challenging, and (2) the generated sentences lack in diversity and naturalness due to deficiencies in the generated word distribution, smaller vocabulary size, and repetitiveness of frequent words and phrases. To alleviate these issues, we propose a novel strategy for training REG models, using minimum risk training (MRT) with maximum likelihood estimation (MLE) and we show that our approach outperforms RL w.r.t naturalness and diversity of the output. Specifically, our approach achieves an increase in CIDEr scores between 23%-57% in two datasets. We further demonstrate the robustness of the proposed method through a detailed comparison with different REG models.

Assessing Discourse Relations in Language Generation from GPT-2
Wei-Jen Ko | Junyi Jessy Li

Recent advances in NLP have been attributed to the emergence of large-scale pre-trained language models. GPT-2, in particular, is suited for generation tasks given its left-to-right language modeling objective, yet the linguistic quality of its generated text has largely remain unexplored. Our work takes a step in understanding GPT-2’s outputs in terms of discourse coherence. We perform a comprehensive study on the validity of explicit discourse relations in GPT-2’s outputs under both organic generation and fine-tuned scenarios. Results show GPT-2 does not always generate text containing valid discourse relations; nevertheless, its text is more aligned with human expectation in the fine-tuned scenario. We propose a decoupled strategy to mitigate these problems and highlight the importance of explicitly modeling discourse information.

Data-to-Text Generation with Iterative Text Editing
Zdeněk Kasner | Ondřej Dušek

We present a novel approach to data-to-text generation based on iterative text editing. Our approach maximizes the completeness and semantic accuracy of the output text while leveraging the abilities of recent pre-trained models for text editing (LaserTagger) and language modeling (GPT-2) to improve the text fluency. To this end, we first transform data items to text using trivial templates, and then we iteratively improve the resulting text by a neural model trained for the sentence fusion task. The output of the model is filtered by a simple heuristic and reranked with an off-the-shelf pre-trained language model. We evaluate our approach on two major data-to-text datasets (WebNLG, Cleaned E2E) and analyze its caveats and benefits. Furthermore, we show that our formulation of data-to-text generation opens up the possibility for zero-shot domain adaptation using a general-domain dataset for sentence fusion.

The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation
Chris van der Lee | Chris Emmery | Sander Wubben | Emiel Krahmer

This paper describes the CACAPO dataset, built for training both neural pipeline and end-to-end data-to-text language generation systems. The dataset is multilingual (Dutch and English), and contains almost 10,000 sentences from human-written news texts in the sports, weather, stocks, and incidents domain, together with aligned attribute-value paired data. The dataset is unique in that the linguistic variation and indirect ways of expressing data in these texts reflect the challenges of real world NLG tasks.

Towards Generating Query to Perform Query Focused Abstractive Summarization using Pre-trained Model
Deen Mohammad Abdullah | Yllias Chali

Query Focused Abstractive Summarization (QFAS) represents an abstractive summary from the source document based on a given query. To measure the performance of abstractive summarization tasks, different datasets have been broadly used. However, for QFAS tasks, only a limited number of datasets have been used, which are comparatively small and provide single sentence summaries. This paper presents a query generation approach, where we considered most similar words between documents and summaries for generating queries. By implementing our query generation approach, we prepared two relatively large datasets, namely CNN/DailyMail and Newsroom which contain multiple sentence summaries and can be used for future QFAS tasks. We also implemented a pre-processing approach to perform QFAS tasks using a pretrained language model, BERTSUM. In our pre-processing approach, we sorted the sentences of the documents from the most query-related sentences to the less query-related sentences. Then, we fine-tuned the BERTSUM model for generating the abstractive summaries. We also experimented on one of the largely used datasets, Debatepedia, to compare our QFAS approach with other models. The experimental results show that our approach outperforms the state-of-the-art models on three ROUGE scores.

SimpleNLG-TI: Adapting SimpleNLG to Tibetan
Zewang Kuanzhuo | Li Lin | Zhao Weina

Surface realisation is the last but not the least phase of Natural Language Generation, which aims to produce high-quality natural language text based on meaning representations. In this article, we present our work on SimpleNLG-TI, a Tibetan surface realiser, which follows the design paradigm of SimpleNLG-EN. SimpleNLG-TI is built up by our investigation of the core features of Tibetan morphology and syntax. Through this work, we provide a robust and flexible surface realiser for Tibetan generation systems.

Machine Translation Pre-training for Data-to-Text Generation - A Case Study in Czech
Mihir Kale | Scott Roy

While there is a large body of research studying deep learning methods for text generation from structured data, almost all of it focuses purely on English. In this paper, we study the effectiveness of machine translation based pre-training for data-to-text generation in non-English languages. Since the structured data is generally expressed in English, text generation into other languages involves elements of translation, transliteration and copying - elements already encoded in neural machine translation systems. Moreover, since data-to-text corpora are typically small, this task can benefit greatly from pre-training. We conduct experiments on Czech, a morphologically complex language. Results show that machine translation pre-training lets us train endto-end models that significantly improve upon unsupervised pre-training and linguistically informed pipelined neural systems, as judged by automatic metrics and human evaluation. We also show that this approach enjoys several desirable properties, including improved performance in low data scenarios and applicability to low resource languages.

Text-to-Text Pre-Training for Data-to-Text Tasks
Mihir Kale | Abhinav Rastogi

We study the pre-train + fine-tune strategy for data-to-text tasks. Our experiments indicate that text-to-text pre-training in the form of T5 (Raffel et al., 2019), enables simple, end-to-end transformer based models to outperform pipelined neural architectures tailored for data-to-text generation, as well as alternatives such as BERT and GPT-2. Importantly, T5 pre-training leads to better generalization, as evidenced by large improvements on out-ofdomain test sets. We hope our work serves as a useful baseline for future research, as transfer learning becomes ever more prevalent for data-to-text tasks.

DaMata: A Robot-Journalist Covering the Brazilian Amazon Deforestation
André Luiz Rosa Teixeira | João Campos | Rossana Cunha | Thiago Castro Ferreira | Adriana Pagano | Fabio Cozman

This demo paper introduces DaMata, a robot-journalist covering deforestation in the Brazilian Amazon. The robot-journalist is based on a pipeline architecture of Natural Language Generation, which yields multilingual daily and monthly reports based on the public data provided by DETER, a real-time deforestation satellite monitor developed and maintained by the Brazilian National Institute for Space Research (INPE). DaMata automatically generates reports in Brazilian Portuguese and English and publishes them on the Twitter platform. Corpus and code are publicly available.

Generating Quantified Referring Expressions through Attention-Driven Incremental Perception
Gordon Briggs

We model the production of quantified referring expressions (QREs) that identity collections of visual items. A previous approach, called Perceptual Cost Pruning, modeled human QRE production using a preference-based referring expression generation algorithm, first removing facts from the input knowledge base based on a model of perceptual cost. In this paper, we present an alternative model that incrementally constructs a symbolic knowledge base through simulating human visual attention/perception from raw images. We demonstrate that this model produces the same output as Perceptual Cost Pruning. We argue that this is a more extensible approach and a step toward developing a wider range of process-level models of human visual description.

Rich Syntactic and Semantic Information Helps Unsupervised Text Style Transfer
Hongyu Gong | Linfeng Song | Suma Bhat

Text style transfer aims to change an input sentence to an output sentence by changing its text style while preserving the content. Previous efforts on unsupervised text style transfer only use the surface features of words and sentences. As a result, the transferred sentences may either have inaccurate or missing information compared to the inputs. We address this issue by explicitly enriching the inputs via syntactic and semantic structures, from which richer features are then extracted to better capture the original information. Experiments on two text-style-transfer tasks show that our approach improves the content preservation of a strong unsupervised baseline model thereby demonstrating improved transfer performance.

PARENTing via Model-Agnostic Reinforcement Learning to Correct Pathological Behaviors in Data-to-Text Generation
Clement Rebuffel | Laure Soulier | Geoffrey Scoutheeten | Patrick Gallinari

In language generation models conditioned by structured data, the classical training via maximum likelihood almost always leads models to pick up on dataset divergence (i.e., hallucinations or omissions), and to incorporate them erroneously in their own generations at inference. In this work, we build on top of previous Reinforcement Learning based approaches and show that a model-agnostic framework relying on the recently introduced PARENT metric is efficient at reducing both hallucinations and omissions. Evaluations on the widely used WikiBIO and WebNLG benchmarks demonstrate the effectiveness of this framework compared to state-of-the-art models.

Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference
Ondřej Dušek | Zdeněk Kasner

A major challenge in evaluating data-to-text (D2T) generation is measuring the semantic accuracy of the generated text, i.e. checking if the output text contains all and only facts supported by the input data. We propose a new metric for evaluating the semantic accuracy of D2T generation based on a neural model pretrained for natural language inference (NLI). We use the NLI model to check textual entailment between the input data and the output text in both directions, allowing us to reveal omissions or hallucinations. Input data are converted to text for NLI using trivial templates. Our experiments on two recent D2T datasets show that our metric can achieve high accuracy in identifying erroneous system outputs.

Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model
Jason Obeid | Enamul Hoque

Information visualizations such as bar charts and line charts are very popular for exploring data and communicating insights. Interpreting and making sense of such visualizations can be challenging for some people, such as those who are visually impaired or have low visualization literacy. In this work, we introduce a new dataset and present a neural model for automatically generating natural language summaries for charts. The generated summaries provide an interpretation of the chart and convey the key insights found within that chart. Our neural model is developed by extending the state-of-the-art model for the data-to-text generation task, which utilizes a transformer-based encoder-decoder architecture. We found that our approach outperforms the base model on a content selection metric by a wide margin (55.42% vs. 8.49%) and generates more informative, concise, and coherent summaries.

Market Comment Generation from Data with Noisy Alignments
Yumi Hamazono | Yui Uehara | Hiroshi Noji | Yusuke Miyao | Hiroya Takamura | Ichiro Kobayashi

End-to-end models on data-to-text learn the mapping of data and text from the aligned pairs in the dataset. However, these alignments are not always obtained reliably, especially for the time-series data, for which real time comments are given to some situation and there might be a delay in the comment delivery time compared to the actual event time. To handle this issue of possible noisy alignments in the dataset, we propose a neural network model with multi-timestep data and a copy mechanism, which allows the models to learn the correspondences between data and text from the dataset with noisier alignments. We focus on generating market comments in Japanese that are delivered each time an event occurs in the market. The core idea of our approach is to utilize multi-timestep data, which is not only the latest market price data when the comment is delivered, but also the data obtained at several timesteps earlier. On top of this, we employ a copy mechanism that is suitable for referring to the content of data records in the market price data. We confirm the superiority of our proposal by two evaluation metrics and show the accuracy improvement of the sentence generation using the time series data by our proposed method.

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems
Craig Thomson | Ehud Reiter

Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics.

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions
David M. Howcroft | Anya Belz | Miruna-Adriana Clinciu | Dimitra Gkatzia | Sadid A. Hasan | Saad Mahamood | Simon Mille | Emiel van Miltenburg | Sashank Santhanam | Verena Rieser

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing
Anya Belz | Simon Mille | David M. Howcroft

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs. This has serious implications for reproducibility testing and meta-evaluation, in particular given that human evaluation is considered the gold standard against which the trustworthiness of automatic metrics is gauged. %and merging others, as well as deciding which evaluations should be able to reproduce each other’s results. Using examples from NLG, we propose a classification system for evaluations based on disentangling (i) what is being evaluated (which aspect of quality), and (ii) how it is evaluated in specific (a) evaluation modes and (b) experimental designs. We show that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing.

Stable Style Transformer: Delete and Generate Approach with Encoder-Decoder for Text Style Transfer
Joosung Lee

Text style transfer is the task that generates a sentence by preserving the content of the input sentence and transferring the style. Most existing studies are progressing on non-parallel datasets because parallel datasets are limited and hard to construct. In this work, we introduce a method that follows two stages in non-parallel datasets. The first stage is to delete attribute markers of a sentence directly through a classifier. The second stage is to generate a transferred sentence by combining the content tokens and the target style. We experiment on two benchmark datasets and evaluate context, style, fluency, and semantic. It is difficult to select the best system using only these automatic metrics, but it is possible to select stable systems. We consider only robust systems in all automatic evaluation metrics to be the minimum conditions that can be used in real applications. Many previous systems are difficult to use in certain situations because performance is significantly lower in several evaluation metrics. However, our system is stable in all automatic evaluation metrics and has results comparable to other models. Also, we compare the performance results of our system and the unstable system through human evaluation.

Listener’s Social Identity Matters in Personalised Response Generation
Guanyi Chen | Yinhe Zheng | Yupei Du

Personalised response generation enables generating human-like responses by means of assigning the generator a social identity. However, pragmatics theory suggests that human beings adjust the way of speaking based on not only who they are but also whom they are talking to. In other words, when modelling personalised dialogues, it might be favourable if we also take the listener’s social identity into consideration. To validate this idea, we use gender as a typical example of a social variable to investigate how the listener’s identity influences the language used in Chinese dialogues on social media. Also, we build personalised generators. The experiment results demonstrate that the listener’s identity indeed matters in the language use of responses and that the response generator can capture such differences in language use. More interestingly, by additionally modelling the listener’s identity, the personalised response generator performs better in its own identity.

Understanding and Explicitly Measuring Linguistic and Stylistic Properties of Deception via Generation and Translation
Emily Saldanha | Aparna Garimella | Svitlana Volkova

Massive digital disinformation is one of the main risks of modern society. Hundreds of models and linguistic analyses have been done to compare and contrast misleading and credible content online. However, most models do not remove the confounding factor of a topic or narrative when training, so the resulting models learn a clear topical separation for misleading versus credible content. We study the feasibility of using two strategies to disentangle the topic bias from the models to understand and explicitly measure linguistic and stylistic properties of content from misleading versus credible content. First, we develop conditional generative models to create news content that is characteristic of different credibility levels. We perform multi-dimensional evaluation of model performance on mimicking both the style and linguistic differences that distinguish news of different credibility using machine translation metrics and classification models. We show that even though generative models are able to imitate both the style and language of the original content, additional conditioning on both the news category and the topic leads to reduced performance. In a second approach, we perform deception style “transfer” by translating deceptive content into the style of credible content and vice versa. Extending earlier studies, we demonstrate that, when conditioned on a topic, deceptive content is shorter, less readable, more biased, and more subjective than credible content, and transferring the style from deceptive to credible content is more challenging than the opposite direction.

Shared Task on Evaluating Accuracy
Ehud Reiter | Craig Thomson

We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts, specifically summaries of basketball games produced from basketball box score and other game data. We welcome submissions based on protocols for human evaluation, automatic metrics, as well as combinations of human evaluations and metrics.

ReproGen: Proposal for a Shared Task on Reproducibility of Human Evaluations in NLG
Anya Belz | Shubham Agarwal | Anastasia Shimorina | Ehud Reiter

Across NLP, a growing body of work is looking at the issue of reproducibility. However, replicability of human evaluation experiments and reproducibility of their results is currently under-addressed, and this is of particular concern for NLG where human evaluations are the norm. This paper outlines our ideas for a shared task on reproducibility of human evaluations in NLG which aims (i) to shed light on the extent to which past NLG evaluations are replicable and reproducible, and (ii) to draw conclusions regarding how evaluations can be designed and reported to increase replicability and reproducibility. If the task is run over several years, we hope to be able to document an overall increase in levels of replicability and reproducibility over time.

Task Proposal: Abstractive Snippet Generation for Web Pages
Shahbaz Syed | Wei-Fan Chen | Matthias Hagen | Benno Stein | Henning Wachsmuth | Martin Potthast

We propose a shared task on abstractive snippet generation for web pages, a novel task of generating query-biased abstractive summaries for documents that are to be shown on a search results page. Conventional snippets are extractive in nature, which recently gave rise to copyright claims from news publishers as well as a new copyright legislation being passed in the European Union, limiting the fair use of web page contents for snippets. At the same time, abstractive summarization has matured considerably in recent years, potentially allowing for more personalization of snippets in the future. Taken together, these facts render further research into generating abstractive snippets both timely and promising.

BERT-Based Simplification of Japanese Sentence-Ending Predicates in Descriptive Text
Taichi Kato | Rei Miyata | Satoshi Sato

Japanese sentence-ending predicates intricately combine content words and functional elements, such as aspect, modality, and honorifics; this can often hinder the understanding of language learners and children. Conventional lexical simplification methods, which replace difficult target words with simpler synonyms acquired from lexical resources in a word-by-word manner, are not always suitable for the simplification of such Japanese predicates. Given this situation, we propose a BERT-based simplification method, the core feature of which is the high ability to substitute the whole predicates with simple ones while maintaining their core meanings in the context by utilizing pre-trained masked language models. Experimental results showed that our proposed methods consistently outperformed the conventional thesaurus-based method by a wide margin. Furthermore, we investigated in detail the effectiveness of the average token embedding and dropout, and the remaining errors of our BERT-based methods.

Amplifying the Range of News Stories with Creativity: Methods and their Evaluation, in Portuguese
Rui Mendes | Hugo Gonçalo Oliveira

Headlines are key for attracting people to a story, but writing appealing headlines requires time and talent. This work aims to automate the production of creative short texts (e.g., news headlines) for an input context (e.g., existing headlines), thus amplifying its range. Well-known expressions (e.g., proverbs, movie titles), which typically include word-play and resort to figurative language, are used as a starting point. Given an input text, they can be recommended by exploiting Semantic Textual Similarity (STS) techniques, or adapted towards higher relatedness. For the latter, three methods that exploit static word embeddings are proposed. Experimentation in Portuguese lead to some conclusions, based on human opinions: STS methods that look exclusively at the surface text, recommend more related expressions; resulting expressions are somewhat related to the input, but adaptation leads to higher relatedness and novelty; humour can be an indirect consequence, but most outputs are not funny.

Lessons from Computational Modelling of Reference Production in Mandarin and English
Guanyi Chen | Kees van Deemter

Referring expression generation (REG) algorithms offer computational models of the production of referring expressions. In earlier work, a corpus of referring expressions (REs) in Mandarin was introduced. In the present paper, we annotate this corpus, evaluate classic REG algorithms on it, and compare the results with earlier results on the evaluation of REG for English referring expressions. Next, we offer an in-depth analysis of the corpus, focusing on issues that arise from the grammar of Mandarin. We discuss shortcomings of previous REG evaluations that came to light during our investigation and we highlight some surprising results. Perhaps most strikingly, we found a much higher proportion of under-specified expressions than previous studies had suggested, not just in Mandarin but in English as well.

Generating Varied Training Corpora in Runyankore Using a Combined Semantic and Syntactic, Pattern-Grammar-based Approach
Joan Byamugisha

Machine learning algorithms have been applied to achieve high levels of accuracy in tasks associated with the processing of natural language. However, these algorithms require large amounts of training data in order to perform efficiently. Since most Bantu languages lack the required training corpora because they are computationally under-resourced, we investigated how to generate a large varied training corpus in Runyankore, a Bantu language indigenous to Uganda. We found the use of a combined semantic and syntactic, pattern and grammar-based approach to be applicable to this purpose, and used it to generate one million sentences, both labelled and unlabelled, which can be applied as training data for machine learning algorithms. The generated text was evaluated in two ways: (1) assessing the semantics encoded in word embeddings obtained from the generated text, which showed correct word similarity; and (2) applying the labelled data to tasks such as sentiment analysis, which achieved satisfactory levels of accuracy.

Schema-Guided Natural Language Generation
Yuheng Du | Shereen Oraby | Vittorio Perera | Minmin Shen | Anjali Narayan-Chen | Tagyoung Chung | Anushree Venkatesh | Dilek Hakkani-Tur

Neural network based approaches to data-to-text natural language generation (NLG) have gained popularity in recent years, with the goal of generating a natural language prompt that accurately realizes an input meaning representation. To facilitate the training of neural network models, researchers created large datasets of paired utterances and their meaning representations. However, the creation of such datasets is an arduous task and they mostly consist of simple meaning representations composed of slot and value tokens to be realized. These representations do not include any contextual information that an NLG system can use when trying to generalize, such as domain information and descriptions of slots and values. In this paper, we present the novel task of Schema-Guided Natural Language Generation (SG-NLG). Here, the goal is still to generate a natural language prompt, but in SG-NLG, the input MRs are paired with rich schemata providing contextual information. To generate a dataset for SG-NLG we re-purpose an existing dataset for another task: dialog state tracking, which includes a large and rich schema spanning multiple different attributes, including information about the domain, user intent, and slot descriptions. We train different state-of-the-art models for neural natural language generation on this dataset and show that in many cases, including rich schema information allows our models to produce higher quality outputs both in terms of semantics and diversity. We also conduct experiments comparing model performance on seen versus unseen domains, and present a human evaluation demonstrating high ratings for overall output quality.

OMEGA : A probabilistic approach to referring expression generation in a virtual environment
Maurice Langner

In recent years, referring expression genera- tion algorithms were inspired by game theory and probability theory. In this paper, an al- gorithm is designed for the generation of re- ferring expressions (REG) that base on both models by integrating maximization of utilities into the content determination process. It im- plements cognitive models for assessing visual salience of objects and additional features. In order to evaluate the algorithm properly and validate the applicability of existing models and evaluative information criteria, both, pro- duction and comprehension studies, are con- ducted using a complex domain of objects, pro- viding new directions of approaching the eval- uation of REG algorithms.

Neural NLG for Methodius: From RST Meaning Representations to Texts
Symon Stevens-Guille | Aleksandre Maskharashvili | Amy Isard | Xintong Li | Michael White

While classic NLG systems typically made use of hierarchically structured content plans that included discourse relations as central components, more recent neural approaches have mostly mapped simple, flat inputs to texts without representing discourse relations explicitly. In this paper, we investigate whether it is beneficial to include discourse relations in the input to neural data-to-text generators for texts where discourse relations play an important role. To do so, we reimplement the sentence planning and realization components of a classic NLG system, Methodius, using LSTM sequence-to-sequence (seq2seq) models. We find that although seq2seq models can learn to generate fluent and grammatical texts remarkably well with sufficiently representative Methodius training data, they cannot learn to correctly express Methodius’s similarity and contrast comparisons unless the corresponding RST relations are included in the inputs. Additionally, we experiment with using self-training and reverse model reranking to better handle train/test data mismatches, and find that while these methods help reduce content errors, it remains essential to include discourse relations in the input to obtain optimal performance.

From “Before” to “After”: Generating Natural Language Instructions from Image Pairs in a Simple Visual Domain
Robin Rojowiec | Jana Götze | Philipp Sadler | Henrik Voigt | Sina Zarrieß | David Schlangen

While certain types of instructions can be com-pactly expressed via images, there are situations where one might want to verbalise them, for example when directing someone. We investigate the task of Instruction Generation from Before/After Image Pairs which is to derive from images an instruction for effecting the implied change. For this, we make use of prior work on instruction following in a visual environment. We take an existing dataset, the BLOCKS data collected by Bisk et al. (2016) and investigate whether it is suitable for training an instruction generator as well. We find that it is, and investigate several simple baselines, taking these from the related task of image captioning. Through a series of experiments that simplify the task (by making image processing easier or completely side-stepping it; and by creating template-based targeted instructions), we investigate areas for improvement. We find that captioning models get some way towards solving the task, but have some difficulty with it, and future improvements must lie in the way the change is detected in the instruction.

What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom | Patrick Bordes | Paul-Alexis Dray | Jacopo Staiano | Patrick Gallinari

Pre-trained language models have recently contributed to significant advances in NLP tasks. Recently, multi-modal versions of BERT have been developed, using heavy pre-training relying on vast corpora of aligned textual and image data, primarily applied to classification tasks such as VQA. In this paper, we are interested in evaluating the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data. We choose to study Visual Question Generation, a task of great interest for grounded dialog, that enables to study the impact of each modality (as input can be visual and/or textual). Moreover, the generation aspect of the task requires an adaptation since BERT is primarily designed as an encoder. We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations. The results reported under different configurations indicate an innate capacity for BERT-gen to adapt to multi-modal data and text generation, even with few data available, avoiding expensive pre-training. The proposed model obtains substantial improvements over the state-of-the-art on two established VQG datasets.

When an Image Tells a Story: The Role of Visual and Semantic Information for Generating Paragraph Descriptions
Nikolai Ilinykh | Simon Dobnik

Generating multi-sentence image descriptions is a challenging task, which requires a good model to produce coherent and accurate paragraphs, describing salient objects in the image. We argue that multiple sources of information are beneficial when describing visual scenes with long sequences. These include (i) perceptual information and (ii) semantic (language) information about how to describe what is in the image. We also compare the effects of using two different pooling mechanisms on either a single modality or their combination. We demonstrate that the model which utilises both visual and language inputs can be used to generate accurate and diverse paragraphs when combined with a particular pooling mechanism. The results of our automatic and human evaluation show that learning to embed semantic information along with visual stimuli into the paragraph generation model is not trivial, raising a variety of proposals for future experiments.

Transformer based Natural Language Generation for Question-Answering
Imen Akermi | Johannes Heinecke | Frédéric Herledan

This paper explores Natural Language Generation within the context of Question-Answering task. The several works addressing this task only focused on generating a short answer or a long text span that contains the answer, while reasoning over a Web page or processing structured data. Such answers’ length are usually not appropriate as the answer tend to be perceived as too brief or too long to be read out loud by an intelligent assistant. In this work, we aim at generating a concise answer for a given question using an unsupervised approach that does not require annotated data. Tested over English and French datasets, the proposed approach shows very promising results.

Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders
Nikola I. Nikolov | Eric Malmi | Curtis Northcutt | Loreto Parisi

The ability to combine symbols to generate language is a defining characteristic of human intelligence, particularly in the context of artistic story-telling through lyrics. We develop a method for synthesizing a rap verse based on the content of any text (e.g., a news article), or for augmenting pre-existing rap lyrics. Our method, called Rapformer, is based on training a Transformer-based denoising autoencoder to reconstruct rap lyrics from content words extracted from the lyrics, trying to preserve the essential meaning, while matching the target style. Rapformer features a novel BERT-based paraphrasing scheme for rhyme enhancement which increases the average rhyme density of output lyrics by 10%. Experimental results on three diverse input domains show that Rapformer is capable of generating technically fluent verses that offer a good trade-off between content preservation and style transfer. Furthermore, a Turing-test-like experiment reveals that Rapformer fools human lyrics experts 25% of the time.

Reducing Non-Normative Text Generation from Language Models
Xiangyu Peng | Siyan Li | Spencer Frazier | Mark Riedl

Large-scale, transformer-based language models such as GPT-2 are pretrained on diverse corpora scraped from the internet. Consequently, they are prone to generating non-normative text (i.e. in violation of social norms). We introduce a technique for fine-tuning GPT-2, using a policy gradient reinforcement learning technique and a normative text classifier to produce reward and punishment values. We evaluate our technique on five data sets using automated and human participant experiments. The normative text classifier is 81-90% accurate when compared to gold-standard human judgements of normative and non-normative generated text. Our normative fine-tuning technique is able to reduce non-normative text by 27-61%, depending on the data set.

ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis
Qingyun Wang | Qi Zeng | Lifu Huang | Kevin Knight | Heng Ji | Nazneen Fatema Rajani

To assist human review process, we build a novel ReviewRobot to automatically assign a review score and write comments for multiple categories such as novelty and meaningful comparison. A good review needs to be knowledgeable, namely that the comments should be constructive and informative to help improve the paper; and explainable by providing detailed evidence. ReviewRobot achieves these goals via three steps: (1) We perform domain-specific Information Extraction to construct a knowledge graph (KG) from the target paper under review, a related work KG from the papers cited by the target paper, and a background KG from a large collection of previous papers in the domain. (2) By comparing these three KGs, we predict a review score and detailed structured knowledge as evidence for each review category. (3) We carefully select and generalize human review sentences into templates, and apply these templates to transform the review scores and evidence into natural language comments. Experimental results show that our review score predictor reaches 71.4%-100% accuracy. Human assessment by domain experts shows that 41.7%-70.5% of the comments generated by ReviewRobot are valid and constructive, and better than human-written ones for 20% of the time. Thus, ReviewRobot can serve as an assistant for paper reviewers, program chairs and authors.

Gradations of Error Severity in Automatic Image Descriptions
Emiel van Miltenburg | Wei-Ting Lu | Emiel Krahmer | Albert Gatt | Guanyi Chen | Lin Li | Kees van Deemter

Earlier research has shown that evaluation metrics based on textual similarity (e.g., BLEU, CIDEr, Meteor) do not correlate well with human evaluation scores for automatically generated text. We carried out an experiment with Chinese speakers, where we systematically manipulated image descriptions to contain different kinds of errors. Because our manipulated descriptions form minimal pairs with the reference descriptions, we are able to assess the impact of different kinds of errors on the perceived quality of the descriptions. Our results show that different kinds of errors elicit significantly different evaluation scores, even though all erroneous descriptions differ in only one character from the reference descriptions. Evaluation metrics based solely on textual similarity are unable to capture these differences, which (at least partially) explains their poor correlation with human judgments. Our work provides the foundations for future work, where we aim to understand why different errors are seen as more or less severe.

Policy-Driven Neural Response Generation for Knowledge-Grounded Dialog Systems
Behnam Hedayatnia | Karthik Gopalakrishnan | Seokhwan Kim | Yang Liu | Mihail Eric | Dilek Hakkani-Tur

Open-domain dialog systems aim to generate relevant, informative and engaging responses. In this paper, we propose using a dialog policy to plan the content and style of target, open domain responses in the form of an action plan, which includes knowledge sentences related to the dialog context, targeted dialog acts, topic information, etc. For training, the attributes within the action plan are obtained by automatically annotating the publicly released Topical-Chat dataset. We condition neural response generators on the action plan which is then realized as target utterances at the turn and sentence levels. We also investigate different dialog policy models to predict an action plan given the dialog context. Through automated and human evaluation, we measure the appropriateness of the generated responses and check if the generation models indeed learn to realize the given action plans. We demonstrate that a basic dialog policy that operates at the sentence level generates better responses in comparison to turn level generation as well as baseline models with no action plan. Additionally the basic dialog policy has the added benefit of controllability.


pdf (full)
bib (full)
Proceedings of the 1st Workshop on Evaluating NLG Evaluation

pdf bib
Proceedings of the 1st Workshop on Evaluating NLG Evaluation
Shubham Agarwal | Ondřej Dušek | Sebastian Gehrmann | Dimitra Gkatzia | Ioannis Konstas | Emiel Van Miltenburg | Sashank Santhanam

pdf bib
A proof of concept on triangular test evaluation for Natural Language Generation
Javier González Corbelle | José María Alonso Moral | Alberto Bugarín Diz

The evaluation of Natural Language Generation (NLG) systems has recently aroused much interest in the research community, since it should address several challenging aspects, such as readability of the generated texts, adequacy to the user within a particular context and moment and linguistic quality-related issues (e.g., correctness, coherence, understandability), among others. In this paper, we propose a novel technique for evaluating NLG systems that is inspired on the triangular test used in the field of sensory analysis. This technique allows us to compare two texts generated by different subjects and to i) determine whether statistically significant differences are detected between them when evaluated by humans and ii) quantify to what extent the number of evaluators plays an important role in the sensitivity of the results. As a proof of concept, we apply this evaluation technique in a real use case in the field of meteorology, showing the advantages and disadvantages of our proposal.

pdf bib
“This is a Problem, Don’t You Agree?” Framing and Bias in Human Evaluation for Natural Language Generation
Stephanie Schoch | Diyi Yang | Yangfeng Ji

Despite recent efforts reviewing current human evaluation practices for natural language generation (NLG) research, the lack of reported question wording and potential for framing effects or cognitive biases influencing results has been widely overlooked. In this opinion paper, we detail three possible framing effects and cognitive biases that could be imposed on human evaluation in NLG. Based on this, we make a call for increased transparency for human evaluation in NLG and propose the concept of human evaluation statements. We make several recommendations for design details to report that could potentially influence results, such as question wording, and suggest that reporting pertinent design details can help increase comparability across studies as well as reproducibility of results.

Evaluation rules! On the use of grammars and rule-based systems for NLG evaluation
Emiel van Miltenburg | Chris van der Lee | Thiago Castro-Ferreira | Emiel Krahmer

NLG researchers often use uncontrolled corpora to train and evaluate their systems, using textual similarity metrics, such as BLEU. This position paper argues in favour of two alternative evaluation strategies, using grammars or rule-based systems. These strategies are particularly useful to identify the strengths and weaknesses of different systems. We contrast our proposals with the (extended) WebNLG dataset, which is revealed to have a skewed distribution of predicates. We predict that this distribution affects the quality of the predictions for systems trained on this data. However, this hypothesis can only be thoroughly tested (without any confounds) once we are able to systematically manipulate the skewness of the data, using a rule-based approach.

NUBIA: NeUral Based Interchangeability Assessor for Text Generation
Hassan Kane | Muhammed Yusuf Kocyigit | Ali Abdalla | Pelkins Ajanoh | Mohamed Coulibali

We present NUBIA, a methodology to build automatic evaluation metrics for text generation using only machine learning models as core components. A typical NUBIA model is composed of three modules: a neural feature extractor, an aggregator and a calibrator. We demonstrate an implementation of NUBIA showing competitive performance with stateof-the art metrics used to evaluate machine translation and state-of-the art results for image captions quality evaluation. In addition to strong performance, NUBIA models have the advantage of being modular and improve in synergy with advances in text generation models.

On the interaction of automatic evaluation and task framing in headline style transfer
Lorenzo De Mattei | Michele Cafagna | Huiyuan Lai | Felice Dell’Orletta | Malvina Nissim | Albert Gatt

An ongoing debate in the NLG community concerns the best way to evaluate systems, with human evaluation often being considered the most reliable method, compared to corpus-based metrics. However, tasks involving subtle textual differences, such as style transfer, tend to be hard for humans to perform. In this paper, we propose an evaluation method for this task based on purposely-trained classifiers, showing that it better reflects system differences than traditional metrics such as BLEU.


pdf (full)
bib (full)
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence

pdf bib
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence
Jose M. Alonso | Alejandro Catala

pdf bib
Automatically explaining health information
Emiel Khramer

Modern AI systems automatically learn from data using sophisticated statistical models. Explaining how these systems work and how they make their predictions therefore increasingly involves producing descriptions of how different probabilities are weighted and which uncertainties underlie these numbers. But what is the best way to (automatically) present such probabilistic explanations, do people actually understand them, and what is the potential impact of such information on people’s wellbeing? In this talk, I adress these questions in the context of systems that automatically generate personalised health information. The emergence of large national health registeries, such as the Dutch cancer registry, now make it possible to automatically generate descriptions of treatment options for new cancer patients based on data of comparable patients, including health and quality of life predictions following different treatments. I describe a series of studies, in which our team has investigated to what extent this information is currently provided to people, and under which conditions people actually want to have access to these kind of data-driven explanations. Additionally, we have studied whether there are different profiles in information needs, and what the best way is to provide probabilistic information and the associated undertainties to people.

pdf bib
Bias in AI-systems: A multi-step approach
Eirini Ntoutsi

Algorithmic-based decision making powered via AI and (big) data has already penetrated into almost all spheres of human life, from content recommendation and healthcare to predictive policing and autonomous driving, deeply affecting everyone, anywhere, anytime. While technology allows previously unthinkable optimizations in the automation of expensive human decision making, the risks that the technology can pose are also high, leading to an ever increasing public concern about the impact of the technology in our lives. The area of responsible AI has recently emerged in an attempt to put humans at the center of AI-based systems by considering aspects, such as fairness, reliability and privacy of decision-making systems. In this talk, we will focus on the fairness aspect. We will start with understanding the many sources of bias and how biases can enter at each step of the learning process and even get propagated/amplified from previous steps. We will continue with methods for mitigating bias which typically focus on some step of the pipeline (data, algorithms or results) and why it is important to target bias in each step and collectively, in the whole (machine) learning pipeline. We will conclude this talk by discussing accountability issues in connection to bias and in particular, proactive consideration via bias-aware data collection, processing and algorithmic selection and retroactive consideration via explanations.

Content Selection for Explanation Requests in Customer-Care Domain
Luca Anselma | Mirko Di Lascio | Dario Mana | Alessandro Mazzei | Manuela Sanguinetti

This paper describes a content selection module for the generation of explanations in a dialogue system designed for customer care domain. First we describe the construction of a corpus of a dialogues containing explanation requests from customers to a virtual agent of a telco, and second we study and formalize the importance of a specific information content for the generated message. In particular, we adapt the notions of importance and relevance in the case of schematic knowledge bases.

ExTRA: Explainable Therapy-Related Annotations
Mat Rawsthorne | Tahseen Jilani | Jacob Andrews | Yunfei Long | Jeremie Clos | Samuel Malins | Daniel Hunt

In this paper we report progress on a novel explainable artificial intelligence (XAI) initiative applying Natural Language Processing (NLP) with elements of codesign to develop a text classifier for application in psychotherapy training. The task is to produce a tool that will facilitate therapists to review their sessions by automatically labelling transcript text with levels of interaction for patient activation in known psychological processes, using XAI to increase their trust in the model’s suggestions and client trajectory predictions. After pre-processing of the language features extracted from professionally annotated therapy session transcripts, we apply a supervised machine learning approach (CHAID) to classify interaction labels (negative, neutral, positive). Weighted samples are used to overcome class imbalanced data. The results show this initial model can make useful distinctions among the three labels of patient activation with 74% accuracy and provide insight into its reasoning. This ongoing project will additionally evaluate which XAI approaches can be used to increase the transparency of the tool to end users, exploring whether direct involvement of stakeholders improves usability of the XAI interface and therefore trust in the solution.

The Natural Language Pipeline, Neural Text Generation and Explainability
Juliette Faille | Albert Gatt | Claire Gardent

End-to-end encoder-decoder approaches to data-to-text generation are often black boxes whose predictions are difficult to explain. Breaking up the end-to-end model into sub-modules is a natural way to address this problem. The traditional pre-neural Natural Language Generation (NLG) pipeline provides a framework for breaking up the end-to-end encoder-decoder. We survey recent papers that integrate traditional NLG submodules in neural approaches and analyse their explainability. Our survey is a first step towards building explainable neural NLG models.

Towards Harnessing Natural Language Generation to Explain Black-box Models
Ettore Mariotti | Jose M. Alonso | Albert Gatt

The opaque nature of many machine learning techniques prevents the wide adoption of powerful information processing tools for high stakes scenarios. The emerging field eXplainable Artificial Intelligence (XAI) aims at providing justifications for automatic decision-making systems in order to ensure reliability and trustworthiness in the users. For achieving this vision, we emphasize the importance of a natural language textual modality as a key component for a future intelligent interactive agent. We outline the challenges of XAI and review a set of publications that work in this direction.

Explaining Bayesian Networks in Natural Language: State of the Art and Challenges
Conor Hennessy | Alberto Bugarín | Ehud Reiter

In order to increase trust in the usage of Bayesian Networks and to cement their role as a model which can aid in critical decision making, the challenge of explainability must be faced. Previous attempts at explaining Bayesian Networks have largely focused on graphical or visual aids. In this paper we aim to highlight the importance of a natural language approach to explanation and to discuss some of the previous and state of the art attempts of the textual explanation of Bayesian Networks. We outline several challenges that remain to be addressed in the generation and validation of natural language explanations of Bayesian Networks. This can serve as a reference for future work on natural language explanations of Bayesian Networks.

Explaining data using causal Bayesian networks
Jaime Sevilla

I introduce Causal Bayesian Networks as a formalism for representing and explaining probabilistic causal relations, review the state of the art on learning Causal Bayesian Networks and suggest and illustrate a research avenue for studying pairwise identification of causal relations inspired by graphical causality criteria.

Towards Generating Effective Explanations of Logical Formulas: Challenges and Strategies
Alexandra Mayn | Kees van Deemter

While the problem of natural language generation from logical formulas has a long tradition, thus far little attention has been paid to ensuring that the generated explanations are optimally effective for the user. We discuss issues related to deciding what such output should look like and strategies for addressing those issues. We stress the importance of informing generation of NL explanations of logical formulas through reader studies and findings on the comprehension of logic from Pragmatics and Cognitive Science. We then illustrate the discussed issues and potential ways of addressing them using a simple demo system’s output generated from a propositional logic formula.

Argumentation Theoretical Frameworks for Explainable Artificial Intelligence
Martijn Demollin | Qurat-Ul-Ain Shaheen | Katarzyna Budzynska | Carles Sierra

This paper discusses four major argumentation theoretical frameworks with respect to their use in support of explainable artificial intelligence (XAI). We consider these frameworks as useful tools for both system-centred and user-centred XAI. The former is concerned with the generation of explanations for decisions taken by AI systems, while the latter is concerned with the way explanations are given to users and received by them.

Toward Natural Language Mitigation Strategies for Cognitive Biases in Recommender Systems
Alisa Rieger | Mariët Theune | Nava Tintarev

Cognitive biases in the context of consuming online information filtered by recommender systems may lead to sub-optimal choices. One approach to mitigate such biases is through interface and interaction design. This survey reviews studies focused on cognitive bias mitigation of recommender system users during two processes: 1) item selection and 2) preference elicitation. It highlights a number of promising directions for Natural Language Generation research for mitigating cognitive bias including: the need for personalization, as well as for transparency and control.

When to explain: Identifying explanation triggers in human-agent interaction
Lea Krause | Piek Vossen

With more agents deployed than ever, users need to be able to interact and cooperate with them in an effective and comfortable manner. Explanations have been shown to increase the understanding and trust of a user in human-agent interaction. There have been numerous studies investigating this effect, but they rely on the user explicitly requesting an explanation. We propose a first overview of when an explanation should be triggered and show that there are many instances that would be missed if the agent solely relies on direct questions. For this, we differentiate between direct triggers such as commands or questions and introduce indirect triggers like confusion or uncertainty detection.

Learning from Explanations and Demonstrations: A Pilot Study
Silvia Tulli | Sebastian Wallkötter | Ana Paiva | Francisco S. Melo | Mohamed Chetouani

AI has become prominent in a growing number of systems, and, as a direct consequence, the desire for explainability in such systems has become prominent as well. To build explainable systems, a large portion of existing research uses various kinds of natural language technologies, e.g., text-to-speech mechanisms, or string visualizations. Here, we provide an overview of the challenges associated with natural language explanations by reviewing existing literature. Additionally, we discuss the relationship between explainability and knowledge transfer in reinforcement learning. We argue that explainability methods, in particular methods that model the recipient of an explanation, might help increasing sample efficiency. For this, we present a computational approach to optimize the learner’s performance using explanations of another agent and discuss our results in light of effective natural language explanations for humans.

Generating Explanations of Action Failures in a Cognitive Robotic Architecture
Ravenna Thielstrom | Antonio Roque | Meia Chita-Tegmark | Matthias Scheutz

We describe an approach to generating explanations about why robot actions fail, focusing on the considerations of robots that are run by cognitive robotic architectures. We define a set of Failure Types and Explanation Templates, motivating them by the needs and constraints of cognitive architectures that use action scripts and interpretable belief states, and describe content realization and surface realization in this context. We then describe an evaluation that can be extended to further study the effects of varying the explanation templates.


bib (full) Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

pdf bib
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)
Thiago Castro Ferreira | Claire Gardent | Nikolai Ilinykh | Chris van der Lee | Simon Mille | Diego Moussallem | Anastasia Shimorina

pdf bib
A Case Study of NLG from Multimedia Data Sources: Generating Architectural Landmark Descriptions
Simon Mille | Spyridon Symeonidis | Maria Rousi | Montserrat Marimon Felipe | Klearchos Stavrothanasopoulos | Petros Alvanitopoulos | Roberto Carlini Salguero | Jens Grivolla | Georgios Meditskos | Stefanos Vrochidis | Leo Wanner

In this paper, we present a pipeline system that generates architectural landmark descriptions using textual, visual and structured data. The pipeline comprises five main components:(i) a textual analysis component, which extracts information from Wikipedia pages; (ii)a visual analysis component, which extracts information from copyright-free images; (iii) a retrieval component, which gathers relevant (property, subject, object) triples from DBpedia; (iv) a fusion component, which stores the contents from the different modalities in a Knowledge Base (KB) and resolves the conflicts that stem from using different sources of information; (v) an NLG component, which verbalises the resulting contents of the KB. We show that thanks to the addition of other modalities, we can make the verbalisation of DBpedia triples more relevant and/or inspirational.

pdf bib
OWLSIZ: An isiZulu CNL for structured knowledge validation
Zola Mahlaza | C. Maria Keet

In iterative knowledge elicitation, engineers are expected to be directly involved in validating the already captured knowledge and obtaining new knowledge increments, thus making the process time consuming. Languages such as English have controlled natural languages than can be repurposed to generate natural language questions from an ontology in order to allow a domain expert to independently validate the contents of an ontology without understanding a ontology authoring language such as OWL. IsiZulu, South Africa’s main L1 language by number speakers, does not have such a resource, hence, it is not possible to build a verbaliser to generate such questions. Therefore, we propose an isiZulu controlled natural language, called OWL Simplified isiZulu (OWLSIZ), for producing grammatical and fluent questions from an ontology. Human evaluation of the generated questions showed that participants’ judgements agree that most (83%) questions are positive for grammaticality or understandability.

A General Benchmarking Framework for Text Generation
Diego Moussallem | Paramjot Kaur | Thiago Ferreira | Chris van der Lee | Anastasia Shimorina | Felix Conrads | Michael Röder | René Speck | Claire Gardent | Simon Mille | Nikolai Ilinykh | Axel-Cyrille Ngonga Ngomo

The RDF-to-text task has recently gained substantial attention due to the continuous growth of RDF knowledge graphs in number and size. Recent studies have focused on systematically comparing RDF-to-text approaches on benchmarking datasets such as WebNLG. Although some evaluation tools have already been proposed for text generation, none of the existing solutions abides by the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles and involves RDF data for the knowledge extraction task. In this paper, we present BENG, a FAIR benchmarking platform for Natural Language Generation (NLG) and Knowledge Extraction systems with focus on RDF data. BENG builds upon the successful benchmarking platform GERBIL, is opensource and is publicly available along with the data it contains.

Controllable Neural Natural Language Generation: comparison of state-of-the-art control strategies
Yuanmin Leng | François Portet | Cyril Labbé | Raheel Qader

Most NLG systems target text fluency and grammatical correctness, disregarding control over text structure and length. However, control over the output plays an important part in industrial NLG applications. In this paper, we study different strategies of control in triple-totext generation systems particularly from the aspects of text structure and text length. Regarding text structure, we present an approach that relies on aligning the input entities with the facts in the target side. It makes sure that the order and the distribution of entities in both the input and the text are the same. As for control over text length, we show two different approaches. One is to supply length constraint as input while the other is to force the end-ofsentence tag to be included at each step when using top-k decoding strategy. Finally, we propose four metrics to assess the degree to which these methods will affect a NLG system’s ability to control text structure and length. Our analyses demonstrate that all the methods enhance the system’s ability with a slight decrease in text fluency. In addition, constraining length at the input level performs much better than control at decoding level.

Enhancing Sequence-to-Sequence Modelling for RDF triples to Natural Text
Oriol Domingo | David Bergés | Roser Cantenys | Roger Creus | José A. R. Fonollosa

establishes key guidelines on how, which and when Machine Translation (MT) techniques are worth applying to RDF-to-Text task. Not only do we apply and compare the most prominent MT architecture, the Transformer, but we also analyze state-of-the-art techniques such as Byte Pair Encoding or Back Translation to demonstrate an improvement in generalization. In addition, we empirically show how to tailor these techniques to enhance models relying on learned embeddings rather than using pretrained ones. Automatic metrics suggest that Back Translation can significantly improve model performance up to 7 BLEU points, hence, opening a window for surpassing state-of-the-art results with appropriate architectures.

Utilising Knowledge Graph Embeddings for Data-to-Text Generation
Nivranshu Pasricha | Mihael Arcan | Paul Buitelaar

Data-to-text generation has recently seen a move away from modular and pipeline architectures towards end-to-end architectures based on neural networks. In this work, we employ knowledge graph embeddings and explore their utility for end-to-end approaches in a data-to-text generation task. Our experiments show that using knowledge graph embeddings can yield an improvement of up to 2 – 3 BLEU points for seen categories on the WebNLG corpus without modifying the underlying neural network architecture.

The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task: Overview and Evaluation Results (WebNLG+ 2020)
Thiago Castro Ferreira | Claire Gardent | Nikolai Ilinykh | Chris van der Lee | Simon Mille | Diego Moussallem | Anastasia Shimorina

WebNLG+ offers two challenges: (i) mapping sets of RDF triples to English or Russian text (generation) and (ii) converting English or Russian text to sets of RDF triples (semantic parsing). Compared to the eponymous WebNLG challenge, WebNLG+ provides an extended dataset that enable the training, evaluation, and comparison of microplanners and semantic parsers. In this paper, we present the results of the generation and semantic parsing task for both English and Russian and provide a brief description of the participating systems.

CycleGT: Unsupervised Graph-to-Text and Text-to-Graph Generation via Cycle Training
Qipeng Guo | Zhijing Jin | Xipeng Qiu | Weinan Zhang | David Wipf | Zheng Zhang

Two important tasks at the intersection of knowledge graphs and natural language processing are graph-to-text (G2T) and text-tograph (T2G) conversion. Due to the difficulty and high cost of data collection, the supervised data available in the two fields are usually on the magnitude of tens of thousands, for example, 18K in the WebNLG 2017 dataset after preprocessing, which is far fewer than the millions of data for other tasks such as machine translation. Consequently, deep learning models for G2T and T2G suffer largely from scarce training data. We present CycleGT, an unsupervised training method that can bootstrap from fully non-parallel graph and text data, and iteratively back translate between the two forms. Experiments on WebNLG datasets show that our unsupervised model trained on the same number of data achieves performance on par with several fully supervised models. Further experiments on the non-parallel GenWiki dataset verify that our method performs the best among unsupervised baselines. This validates our framework as an effective approach to overcome the data scarcity problem in the fields of G2T and T2G.

Denoising Pre-Training and Data Augmentation Strategies for Enhanced RDF Verbalization with Transformers
Sebastien Montella | Betty Fabre | Tanguy Urvoy | Johannes Heinecke | Lina Rojas-Barahona

The task of verbalization of RDF triples has known a growth in popularity due to the rising ubiquity of Knowledge Bases (KBs). The formalism of RDF triples is a simple and efficient way to store facts at a large scale. However, its abstract representation makes it difficult for humans to interpret. For this purpose, the WebNLG challenge aims at promoting automated RDF-to-text generation. We propose to leverage pre-trainings from augmented data with the Transformer model using a data augmentation strategy. Our experiment results show a minimum relative increases of 3.73%, 126.05% and 88.16% in BLEU score for seen categories, unseen entities and unseen categories respectively over the standard training.

𝒫2: A Plan-and-Pretrain Approach for Knowledge Graph-to-Text Generation
Qipeng Guo | Zhijing Jin | Ning Dai | Xipeng Qiu | Xiangyang Xue | David Wipf | Zheng Zhang

Text verbalization of knowledge graphs is an important problem with wide application to natural language generation (NLG) systems. It is challenging because the generated text not only needs to be grammatically correct (fluency), but also has to contain the given structured knowledge input (relevance) and meet some other criteria. We develop a plan-and-pretrain approach, 𝒫2, which consists of a relational graph convolutional network (RGCN) planner and the pretrained sequence-tosequence (Seq2Seq) model T5. Specifically, the R-GCN planner first generates an order of the knowledge graph triplets, corresponding to the order that they will be mentioned in text, and then T5 produces the surface realization of the given plan. In the WebNLG+ 2020 Challenge, our submission ranked in 1st place on all automatic and human evaluation criteria of the English RDF-to-text generation task.

Improving Text-to-Text Pre-trained Models for the Graph-to-Text Task
Zixiaofan Yang | Arash Einolghozati | Hakan Inan | Keith Diedrick | Angela Fan | Pinar Donmez | Sonal Gupta

Converting a knowledge graph or sub-graph to natural text is useful when answering questions based on a knowledge base. High-capacity language models pre-trained on large-scale text corpora have recently been shown to be powerful when fine-tuned for the knowledge-graph-to-text (KG-to-text) task. In this paper, we propose two classes of methods to improve such pre-trained models for this task. First, we improve the structure awareness of the model by organizing the input as well as learning optimal ordering via multitask learning. Second, we bridge the domain gap between text-to-text and KG-to-text tasks via a second-phase KG-to-text pre-training on similar datasets and extra lexicalization supervision to make the input more similar to natural text. We demonstrate the efficacy of our methods on the popular WebNLG dataset. Our best model achieves an almost 3 point BLEU improvement on a strong baseline while lowering the relative slot-error-rate by around 35%. We also validate our results via human evaluation.

Leveraging Large Pretrained Models for WebNLG 2020
Xintong Li | Aleksandre Maskharashvili | Symon Jory Stevens-Guille | Michael White

In this paper, we report experiments on finetuning large pretrained models to realize resource description framework (RDF) triples to natural language. We provide the details of how to build one of the top-ranked English generation models in WebNLG Challenge 2020. We also show that there appears to be considerable potential for reranking to improve the current state of the art both in terms of statistical metrics and model-based metrics. Our human analyses of the generated texts show that for Russian, pretrained models showed some success, both in terms of lexical and morpho-syntactic choices for generation, as well as for content aggregation. Nevertheless, in a number of cases, the model can be unpredictable, both in terms of failure or success. Omissions of the content and hallucinations, which in many cases occurred at the same time, were major problems. By contrast, the models for English showed near perfect performance on the validation set.

Machine Translation Aided Bilingual Data-to-Text Generation and Semantic Parsing
Oshin Agarwal | Mihir Kale | Heming Ge | Siamak Shakeri | Rami Al-Rfou

We present a system for bilingual Data-ToText Generation and Semantic Parsing. We use a text-to-text generator to learn a single model that works for both languages on each of the tasks. The model is aided by machine translation during both pre-training and fine-tuning. We evaluate the system on WebNLG 2020 data 1 , which consists of RDF triples in English and natural language sentences in English and Russian for both the tasks. We achieve considerable gains over monolingual models, especially on unseen relations and Russian.

NILC at WebNLG+: Pretrained Sequence-to-Sequence Models on RDF-to-Text Generation
Marco Antonio Sobrevilla Cabezudo | Thiago A. S. Pardo

This paper describes the submission by the NILC Computational Linguistics research group of the University of São Paulo/Brazil to the RDF-to-Text task for English at the WebNLG+ challenge. The success of the current pretrained models like BERT or GPT-2 in text-to-text generation tasks is well-known, however, its application/success on data-totext generation has not been well-studied and proven. This way, we explore how good a pretrained model, in particular BART, performs on the data-to-text generation task. The results obtained were worse than the baseline and other systems in almost all automatic measures. However, the human evaluation shows better results for our system. Besides, results suggest that BART may generate paraphrases of reference texts.

NUIG-DSI at the WebNLG+ challenge: Leveraging Transfer Learning for RDF-to-text generation
Nivranshu Pasricha | Mihael Arcan | Paul Buitelaar

This paper describes the system submitted by NUIG-DSI to the WebNLG+ challenge 2020 in the RDF-to-text generation task for the English language. For this challenge, we leverage transfer learning by adopting the T5 model architecture for our submission and fine-tune the model on the WebNLG+ corpus. Our submission ranks among the top five systems for most of the automatic evaluation metrics achieving a BLEU score of 51.74 over all categories with scores of 58.23 and 45.57 across seen and unseen categories respectively.

RDFjsRealB: a Symbolic Approach for Generating Text from RDF Triples
Guy Lapalme

This paper describes the Resource Description Framework (RDF) triples verbalizer developed for the WEB NLG CHALLENGE 2020 shared task. After reviewing representative works in Natural Language Generation in the context of the Semantic Web, the task is then described. We then sketch the symbolic approach we used for verbalizing RDF triples: once the triples are grouped by subject, each group is realized as one or more sentences using templates written in Python whose output is feed to an English realizer written in Javascript. The system was developed using the test data of the previous edition of the task and the train and development data of this year’s task. The automatic scores for this year’s test data are quite competitive. We conclude with a critical review of the data and discuss the suitability of this competition results in a wider Natural Language Generation setting.

Semantic Triples Verbalization with Generative Pre-Training Model
Pavel Blinov

The paper devoted to the problem of automatic text generation from RDF triples. This problem was formalized and proposed as a part of the 2020 WebNLG challenge. We describe our approach to the RDF-to-text generation task based on a neural network model with the Generative Pre-Training (GPT-2) architecture. In particular, we outline a way of base GPT-2 model conversion to a model with language and classification heads and discuss the text generation methods. To research the parameters’ influence on the end-task performance a series of experiments was carried out. We report the result metrics and conclude with possible improvement directions.

Text-to-Text Pre-Training Model with Plan Selection for RDF-to-Text Generation
Natthawut Kertkeidkachorn | Hiroya Takamura

We report our system description for the RDFto-Text task in English on the WebNLG 2020 Challenge. Our approach consists of two parts: 1) RDF-to-Text Generation Pipeline and 2) Plan Selection. RDF-to-Text Generation Pipeline is built on the state-of-the-art pretraining model, while Plan Selection helps decide the proper plan into the pipeline.

The UPC RDF-to-Text System at WebNLG Challenge 2020
David Bergés | Roser Cantenys | Roger Creus | Oriol Domingo | José A. R. Fonollosa

This work describes the end-to-end system architecture presented at WebNLG Challenge 2020. The system follows the traditional Machine Translation (MT) pipeline, based on the Transformer model, applied in most text-totext problems. Our solution is enriched by means of a Back Translation step over the original corpus. Thus, the system directly relies on lexicalise format since the synthetic data limits the use of delexicalisation.

Train Hard, Finetune Easy: Multilingual Denoising for RDF-to-Text Generation
Zdeněk Kasner | Ondřej Dušek

We describe our system for the RDF-to-text generation task of the WebNLG Challenge 2020. We base our approach on the mBART model, which is pre-trained for multilingual denoising. This allows us to use a simple, identical, end-to-end setup for both English and Russian. Requiring minimal taskor languagespecific effort, our model placed in the first third of the leaderboard for English and first or second for Russian on automatic metrics, and it made it into the best or second-best system cluster on human evaluation.

WebNLG 2020 Challenge: Semantic Template Mining for Generating References from RDF
Trung Tran | Dang Tuan Nguyen

We present in this paper our mining system for shared task WebNLG Challenge 2020. The general idea of the system is that we generate the semantic template of the output reference from the input RDF XML structure. In the training process, we perform the following subtasks: (i) extract the core information from input RDF; (ii) generate semantic templates from corresponding references. With new RDF XML data, we detect the core information, in turn add the new template into the warehouse and determine the output semantic template. We will evaluate the output natural language references in two processes: automatic and human evaluations. The results of the first tested process show that our system generates the high quality English descriptions from testing RDF XML structures and has a good contribution to the NLG state-of-the-art.

WebNLG Challenge 2020: Language Agnostic Delexicalisation for Multilingual RDF-to-text generation
Giulio Zhou | Gerasimos Lampouras

This paper presents our submission to the WebNLG Challenge 2020 for the English and Russian RDF-to-text generation tasks. Our first of three submissions is based on Language Agnostic Delexicalisation, a novel delexicalisation method that match values in the input to their occurrences in the corresponding text through comparison of pretrained multilingual embeddings, and employs a character-level post-editing model to inflect words in their correct form during relexicalisation. Our second submission forfeits delexicalisation and uses SentencePiece subwords as basic units. Our third submission combines the previous two by alternating between the output of the delexicalisation-based system when the input contains unseen entities and/or properties and the output of the SentencePiece-based system when the input is seen during training.