International Natural Language Generation Conference (2022)

This demo paper introduces BLAB reporter, a robot-journalist system covering the Brazilian Blue Amazon. The application is based on a pipeline architecture for Natural Language Generation, which offers daily reports, news summaries and curious facts in Brazilian Portuguese. By collecting, storing and analysing structured data from publicly available sources, the robot-journalist uses domain knowledge to generate, validate and publish texts in Twitter. Code and corpus are publicly available.

Quality management and assurance is key for space agencies to guarantee the success of space missions, which are high-risk and extremely costly. In this paper, we present a system to generate quizzes, a common resource to evaluate the effectiveness of training sessions, from documents about quality assurance procedures in the Space domain. Our system leverages state of the art auto-regressive models like T5 and BART to generate questions, and a RoBERTa model to extract answers for such questions, thus verifying their suitability.

pdf abs
Automated Ad Creative Generation
Vishakha Kadam | Yiping Jin | Bao-Dai Nguyen-Hoang

Ad creatives are ads served to users on a webpage, app, or other digital environments. The demand for compelling ad creatives surges drastically with the ever-increasing popularity of digital marketing. The two most essential elements of (display) ad creatives are the advertising message, such as headlines and description texts, and the visual component, such as images and videos. Traditionally, ad creatives are composed by professional copywriters and creative designers. The process requires significant human effort, limiting the scalability and efficiency of digital ad campaigns. This work introduces AUTOCREATIVE, a novel system to automatically generate ad creatives relying on natural language generation and computer vision techniques. The system generates multiple ad copies (ad headlines/description texts) using a sequence-to-sequence model and selects images most suitable to the generated ad copies based on heuristic-based visual appeal metrics and a text-image retrieval pipeline.

We present a free online demo of THEaiTRobot, an open-source bilingual tool for interactively generating theatre play scripts, in two versions. THEaiTRobot 1.0 uses the GPT-2 language model with minimal adjustments. THEaiTRobot 2.0 uses two models created by fine-tuning GPT-2 on purposefully collected and processed datasets and several other components, generating play scripts in a hierarchical fashion (title → synopsis → script). The underlying tool is used in the THEaiTRE project to generate scripts for plays, which are then performed on stage by a professional theatre.

pdf (full)
bib (full) Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

pdf bib
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges
Samira Shaikh | Thiago Ferreira | Amanda Stent

pdf bib abs
The Second Automatic Minuting (AutoMin) Challenge: Generating and Evaluating Minutes from Multi-Party Meetings
Tirthankar Ghosal | Marie Hledíková | Muskaan Singh | Anna Nedoluzhko | Ondřej Bojar

We would host the AutoMin generation chal- lenge at INLG 2023 as a follow-up of the first AutoMin shared task at Interspeech 2021. Our shared task primarily concerns the automated generation of meeting minutes from multi-party meeting transcripts. In our first venture, we ob- served the difficulty of the task and highlighted a number of open problems for the community to discuss, attempt, and solve. Hence, we invite the Natural Language Generation (NLG) com- munity to take part in the second iteration of AutoMin. Like the first, the second AutoMin will feature both English and Czech meetings and the core task of summarizing the manually- revised transcripts into bulleted minutes. A new challenge we are introducing this year is to devise efficient metrics for evaluating the quality of minutes. We will also host an optional track to generate minutes for European parliamentary sessions. We carefully curated the datasets for the above tasks. Our ELITR Minuting Corpus has been recently accepted to LREC 2022 and publicly released. We are already preparing a new test set for evaluating the new shared tasks. We hope to carry forward the learning from the first AutoMin and instigate more community attention and interest in this timely yet chal- lenging problem. INLG, the premier forum for the NLG community, would be an appropriate venue to discuss the challenges and future of Automatic Minuting. The main objective of the AutoMin GenChal at INLG 2023 would be to come up with efficient methods to auto- matically generate meeting minutes and design evaluation metrics to measure the quality of the minutes.

We propose the shared task of cross-lingual conversation summarization, ConvSumX Challenge, opening new avenues for researchers to investigate solutions that integrate conversation summarization and machine translation. This task can be particularly useful due to the emergence of online meetings and conferences. We use a new benchmark, covering 2 real-world scenarios and 3 language directions, including a low-resource language, for evaluation. We hope that ConvSumX can motivate research to go beyond English and break the barrier for non-English speakers to benefit from recent advances of conversation summarization.

pdf abs
HinglishEval Generation Challenge on Quality Estimation of Synthetic Code-Mixed Text: Overview and Results
Vivek Srivastava | Mayank Singh

We hosted a shared task to investigate the factors influencing the quality of the code- mixed text generation systems. The teams experimented with two systems that gener- ate synthetic code-mixed Hinglish sentences. They also experimented with human ratings that evaluate the generation quality of the two systems. The first-of-its-kind, proposed sub- tasks, (i) quality rating prediction and (ii) an- notators’ disagreement prediction of the syn- thetic Hinglish dataset made the shared task quite popular among the multilingual research community. A total of 46 participants com- prising 23 teams from 18 institutions reg- istered for this shared task. The detailed description of the task and the leaderboard is available at https://codalab.lisn. upsaclay.fr/competitions/1688.

Code-Mixing is a phenomenon of mixing two or more languages in a speech event and is prevalent in multilingual societies. Given the low-resource nature of Code-Mixing, machine generation of code-mixed text is a prevalent approach for data augmentation. However, evaluating the quality of such machine gen- erated code-mixed text is an open problem. In our submission to HinglishEval, a shared- task collocated with INLG2022, we attempt to build models factors that impact the quality of synthetically generated code-mix text by pre- dicting ratings for code-mix quality. Hingli- shEval Shared Task consists of two sub-tasks - a) Quality rating prediction); b) Disagree- ment prediction. We leverage popular code- mixed metrics and embeddings of multilin- gual large language models (MLLMs) as fea- tures, and train task specific MLP regression models. Our approach could not beat the baseline results. However, for Subtask-A our team ranked a close second on F-1 and Co- hen’s Kappa Score measures and first for Mean Squared Error measure. For Subtask-B our ap- proach ranked third for F1 score, and first for Mean Squared Error measure. Code of our submission can be accessed here.

pdf abs
niksss at HinglishEval: Language-agnostic BERT-based Contextual Embeddings with Catboost for Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text
Nikhil Singh

This paper describes the system description for the HinglishEval challenge at INLG 2022. The goal of this task was to investigate the factors influencing the quality of the code- mixed text generation system. The task was divided into two subtasks, quality rating pre- diction and annotators’ disagreement predic- tion of the synthetic Hinglish dataset. We at- tempted to solve these tasks using sentence- level embeddings, which are obtained from mean pooling the contextualized word embed- dings for all input tokens in our text. We experimented with various classifiers on top of the embeddings produced for respective tasks. Our best-performing system ranked 1st on subtask B and 3rd on subtask A. We make our code available here: https://github. com/nikhilbyte/Hinglish-qEval

pdf abs
BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish Text Using Transformers
Shaz Furniturewala | Vijay Kumari | Amulya Ratna Dash | Hriday Kedia | Yashvardhan Sharma

Code-Mixed text data consists of sentences having words or phrases from more than one language. Most multi-lingual communities worldwide communicate using multiple lan- guages, with English usually one of them. Hinglish is a Code-Mixed text composed of Hindi and English but written in Roman script. This paper aims to determine the factors in- fluencing the quality of Code-Mixed text data generated by the system. For the Hingli- shEval task, the proposed model uses multi- lingual BERT to find the similarity between synthetically generated and human-generated sentences to predict the quality of synthetically generated Hinglish sentences.

pdf abs
JU_NLP at HinglishEval: Quality Evaluation of the Low-Resource Code-Mixed Hinglish Text
Prantik Guha | Rudra Dhar | Dipankar Das

In this paper we describe a system submit- ted to the INLG 2022 Generation Challenge (GenChal) on Quality Evaluation of the Low- Resource Synthetically Generated Code-Mixed Hinglish Text. We implement a Bi-LSTM- based neural network model to predict the Av- erage rating score and Disagreement score of the synthetic Hinglish dataset. In our mod- els, we used word embeddings for English and Hindi data, and one hot encodings for Hinglish data. We achieved a F1 score of 0.11, and mean squared error of 6.0 in the average rating score prediction task. In the task of Disagreement score prediction, we achieve a F1 score of 0.18, and mean squared error of 5.0.

pdf abs
The 2022 ReproGen Shared Task on Reproducibility of Evaluations in NLG: Overview and Results
Anya Belz | Anastasia Shimorina | Maja Popović | Ehud Reiter

Against a background of growing interest in reproducibility in NLP and ML, and as part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the second shared task on reproducibility of evaluations in NLG, ReproGen 2022. This paper describes the shared task, summarises results from the reproduction studies submitted, and provides further comparative analysis of the results. Out of six initial team registrations, we received submissions from five teams. Meta-analysis of the five reproduction studies revealed varying degrees of reproducibility, and allowed further tentative conclusions about what types of eval- uation tend to have better reproducibility.

pdf abs
Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System
Rudali Huidrom | Ondřej Dušek | Zdeněk Kasner | Thiago Castro Ferreira | Anya Belz

In this paper, we present the results of two re- production studies for the human evaluation originally reported by Dušek and Kasner (2020) in which the authors comparatively evaluated outputs produced by a semantic error detection system for data-to-text generation against ref- erence outputs. In the first reproduction, the original evaluators repeat the evaluation, in a test of the repeatability of the original evalua- tion. In the second study, two new evaluators carry out the evaluation task, in a test of the reproducibility of the original evaluation under otherwise identical conditions. We describe our approach to reproduction, and present and analyse results, finding different degrees of re- producibility depending on result type, data and labelling task. Our resources are available and open-sourced.

pdf abs
Reproducibility of Exploring Neural Text Simplification Models: A Review
Mohammad Arvan | Luís Pina | Natalie Parde

The reproducibility of NLP research has drawn increased attention over the last few years. Several tools, guidelines, and metrics have been introduced to address concerns in regard to this problem; however, much work still remains to ensure widespread adoption of effective reproducibility standards. In this work, we review the reproducibility of Exploring Neural Text Simplification Models by Nisioi et al. (2017), evaluating it from three main aspects: data, software artifacts, and automatic evaluations. We discuss the challenges and issues we faced during this process. Furthermore, we explore the adequacy of current reproducibility standards. Our code, trained models, and a docker container of the environment used for training and evaluation are made publicly available.

pdf abs
The Accuracy Evaluation Shared Task as a Retrospective Reproduction Study
Craig Thomson | Ehud Reiter

We investigate the data collected for the Accuracy Evaluation Shared Task as a retrospective reproduction study. The shared task was based upon errors found by human annotation of com- puter generated summaries of basketball games. Annotation was performed in three separate stages, with texts taken from the same three systems and checked for errors by the same three annotators. We show that the mean count of errors was consistent at the highest level for each experiment, with increased variance when looking at per-system and/or per-error- type breakdowns.

pdf abs
Reproducing a Manual Evaluation of the Simplicity of Text Simplification System Outputs
Maja Popović | Sheila Castilho | Rudali Huidrom | Anya Belz

In this paper we describe our reproduction study of the human evaluation of text simplic- ity reported by Nisioi et al. (2017). The work was carried out as part of the ReproGen Shared Task 2022 on Reproducibility of Evaluations in NLG. Our aim was to repeat the evaluation of simplicity for nine automatic text simplification systems with a different set of evaluators. We describe our experimental design together with the known aspects of the original experimental design and present the results from both studies. Pearson correlation between the original and reproduction scores is moderate to high (0.776). Inter-annotator agreement in the reproduction study is lower (0.40) than in the original study (0.66). We discuss challenges arising from the unavailability of certain aspects of the origi- nal set-up, and make several suggestions as to how reproduction of similar evaluations can be made easier in future.

In this paper, we describe our reproduction ef- fort of the paper: Towards Best Experiment Design for Evaluating Dialogue System Output by Santhanam and Shaikh (2019) for the 2022 ReproGen shared task. We aim to produce the same results, using different human evaluators, and a different implementation of the automatic metrics used in the original paper. Although overall the study posed some challenges to re- produce (e.g. difficulties with reproduction of automatic metrics and statistics), in the end we did find that the results generally replicate the findings of Santhanam and Shaikh (2019) and seem to follow similar trends.

pdf abs
DialogSum Challenge: Results of the Dialogue Summarization Shared Task
Yulong Chen | Naihao Deng | Yang Liu | Yue Zhang

We report the results of DialogSum Challenge, the shared task on summarizing real-life sce- nario dialogues at INLG 2022. Four teams participate in this shared task and three submit their system reports, exploring different meth- ods to improve the performance of dialogue summarization. Although there is a great im- provement over the baseline models regarding automatic evaluation metrics, such as ROUGE scores, we find that there is a salient gap be- tween model generated outputs and human an- notated summaries by human evaluation from multiple aspects. These findings demonstrate the difficulty of dialogue summarization and suggest that more fine-grained evaluatuion met- rics are in need.

pdf abs
TCS_WITM_2022 @ DialogSum : Topic oriented Summarization using Transformer based Encoder Decoder Model
Vipul Chauhan | Prasenjeet Roy | Lipika Dey | Tushar Goel

In this paper, we present our approach to the DialogSum challenge, which was proposed as a shared task aimed to summarize dialogues from real-life scenarios. The challenge was to design a system that can generate fluent and salient summaries of a multi-turn dialogue text. Dialogue summarization has many commercial applications as it can be used to summarize conversations between customers and service agents, meeting notes, conference proceedings etc. Appropriate dialogue summarization can enhance the experience of conversing with chat- bots or personal digital assistants. We have pro- posed a topic-based abstractive summarization method, which is generated by fine-tuning PE- GASUS1, which is the state of the art abstrac- tive summary generation model.We have com- pared different types of fine-tuning approaches that can lead to different types of summaries. We found that since conversations usually veer around a topic, using topics along with the di- aloagues, helps to generate more human-like summaries. The topics in this case resemble user perspective, around which summaries are usually sought. The generated summary has been evaluated with ground truth summaries provided by the challenge owners. We use the py-rouge score and BERT-Score metrics to compare the results.

pdf abs
A Multi-Task Learning Approach for Summarization of Dialogues
Saprativa Bhattacharjee | Kartik Shinde | Tirthankar Ghosal | Asif Ekbal

We describe our multi-task learning based ap- proach for summarization of real-life dialogues as part of the DialogSum Challenge shared task at INLG 2022. Our approach intends to im- prove the main task of abstractive summariza- tion of dialogues through the auxiliary tasks of extractive summarization, novelty detection and language modeling. We conduct extensive experimentation with different combinations of tasks and compare the results. In addition, we also incorporate the topic information provided with the dataset to perform topic-aware sum- marization. We report the results of automatic evaluation of the generated summaries in terms of ROUGE and BERTScore.

pdf abs
Dialogue Summarization using BART
Conrad Lundberg | Leyre Sánchez Viñuela | Siena Biales

This paper introduces the model and settings submitted to the INLG 2022 DialogSum Chal- lenge, a shared task to generate summaries of real-life scenario dialogues between two peo- ple. In this paper, we explored using interme- diate task transfer learning, reported speech, and the use of a supplementary dataset in addi- tion to our base fine-tuned BART model. How- ever, we did not use such a method in our final model, as none improved our results. Our final model for this dialogue task achieved scores only slightly below the top submission, with hidden test set scores of 49.62, 24.98, 46.25 and 91.54 for ROUGE-1, ROUGE-2, ROUGE-L and BERTSCORE respectively. The top submitted models will also receive human evaluation.

pdf (full)
bib (full) Proceedings of the First Workshop on Natural Language Generation in Healthcare

pdf bib
Proceedings of the First Workshop on Natural Language Generation in Healthcare
Emiel Krahmer | Kathy McCoy | Ehud Reiter

pdf bib abs
DrivingBeacon: Driving Behaviour Change Support System Considering Mobile Use and Geo-information
Jawwad Baig | Guanyi Chen | Chenghua Lin | Ehud Reiter

Natural Language Generation has been proved to be effective and efficient in constructing health behaviour change support systems. We are working on DrivingBeacon, a behaviour change support system that uses telematics data from mobile phone sensors to generate weekly data-to-text feedback reports to vehicle drivers. The system makes use of a wealth of information such as mobile phone use while driving, geo-information, speeding, rush hour driving to generate the feedback. We present results from a real-world evaluation where 8 drivers in UK used DrivingBeacon for 4 weeks. Results are promising but not conclusive.

pdf bib abs
In-Domain Pre-Training Improves Clinical Note Generation from Doctor-Patient Conversations
Colin Grambow | Longxiang Zhang | Thomas Schaaf

Summarization of doctor-patient conversations into clinical notes by medical scribes is an essential process for effective clinical care. Pre-trained transformer models have shown a great amount of success in this area, but the domain shift from standard NLP tasks to the medical domain continues to present challenges. We build upon several recent works to show that additional pre-training with in-domain medical conversations leads to performance gains for clinical summarization. In addition to conventional evaluation metrics, we also explore a clinical named entity recognition model for concept-based evaluation. Finally, we contrast long-sequence transformers with a common transformer model, BART. Overall, our findings corroborate research in non-medical domains and suggest that in-domain pre-training combined with transformers for long sequences are effective strategies for summarizing clinical encounters.

pdf abs
LCHQA-Summ: Multi-perspective Summarization of Publicly Sourced Consumer Health Answers
Abari Bhattacharya | Rochana Chaturvedi | Shweta Yadav

Community question answering forums provide a convenient platform for people to source answers to their questions including those related to healthcare from the general public. The answers to user queries are generally long and contain multiple different perspectives, redundancy or irrelevant answers. This presents a novel challenge for domain-specific concise and correct multi-answer summarization which we propose in this paper.

Human health coaching has been established as an effective intervention for improving clients’ health, but it is restricted in scale due to the availability of coaches and finances of the clients. We aim to build a scalable, automated system for physical activity coaching that is similarly grounded in behavior change theories. In this paper, we present our initial steps toward building a flexible system that is capable of carrying out a natural dialogue for goal setting as a health coach would while also offering additional support through just-in-time adaptive interventions. We outline our modular system design and approach to gathering and analyzing data to incrementally implement such a system.

pdf abs
Personalizing Weekly Diet Reports
Elena Monfroglio | Lucas Anselma | Alessandro Mazzei

In this paper we present the main components of a weekly diet report generator (DRG) in natural language. The idea is to produce a text that contains information on the adherence of the dishes eaten during a week to the Mediterranean diet. The system is based on a user model, a database of the dishes eaten during the week and on the automatic computation of the Mediterranean Diet Score. All these sources of information are exploited to produce a highly personalized text.The system has two main goals, related to two different kinds of users: on the one hand, when used by dietitians, the main goal is to highlight the most salient medical information of the patient diet and, on the other hand, when used by final users, the main goal is to educate them toward a Mediterranean style of eating.