Philippe Laban


Quiz Design Task: Helping Teachers Create Quizzes with Automated Question Generation
Philippe Laban | Chien-Sheng Wu | Lidiya Murakhovs’ka | Wenhao Liu | Caiming Xiong
Findings of the Association for Computational Linguistics: NAACL 2022

Question generation (QGen) models are often evaluated with standardized NLG metrics that are based on n-gram overlap.In this paper, we measure whether these metric improvements translate to gains in a practical setting, focusing on the use case of helping teachers automate the generation of reading comprehension quizzes. In our study, teachers building a quiz receive question suggestions, which they can either accept or refuse with a reason. Even though we find that recent progress in QGen leads to a significant increase in question acceptance rates, there is still large room for improvement, with the best model having only 68.4% of its questions accepted by the ten teachers who participated in our study. We then leverage the annotations we collected to analyze standard NLG metrics and find that model performance has reached projected upper-bounds, suggesting new automatic metrics are needed to guide QGen research forward.

MixQG: Neural Question Generation with Mixed Answer Types
Lidiya Murakhovs’ka | Chien-Sheng Wu | Philippe Laban | Tong Niu | Wenhao Liu | Caiming Xiong
Findings of the Association for Computational Linguistics: NAACL 2022

Asking good questions is an essential ability for both human and machine intelligence. However, existing neural question generation approaches mainly focus on short factoid type of answers. In this paper, we introduce a neural question generator, MixQG, to bridge this gap. We combine nine question answering datasets with diverse answer types, including yes/no, multiple-choice, extractive, and abstractive answers, to train a single generative model. We show with empirical results that our model outperforms existing work in both seen and unseen domains, and can generate questions with different cognitive levels when conditioned on different answer types. We run a human evaluation study to assess the quality of generated questions and find that MixQG outperforms the next best model by 10%. Our code and model checkpoints will be released and integrated with the HuggingFace library to facilitate various downstream applications.

Discord Questions: A Computational Approach To Diversity Analysis in News Coverage
Philippe Laban | Chien-Sheng Wu | Lidiya Murakhovs’ka | Xiang Chen | Caiming Xiong
Findings of the Association for Computational Linguistics: EMNLP 2022

There are many potential benefits to news readers accessing diverse sources. Modern news aggregators do the hard work of organizing the news, offering readers a plethora of source options, but choosing which source to read remains challenging.We propose a new framework to assist readers in identifying source differences and gaining an understanding of news coverage diversity.The framework is based on the generation of Discord Questions: questions with a diverse answer pool, explicitly illustrating source differences.To assemble a prototype of the framework, we focus on two components: (1) discord question generation, the task of generating questions answered differently by sources, for which we propose an automatic scoring method, and create a model that improves performance from current question generation (QG) methods by 5%, (2) answer consolidation, the task of grouping answers to a question that are semantically similar, for which we collect data and repurpose a method that achieves 81% balanced accuracy on our realistic test set.We illustrate the framework’s feasibility through a prototype interface. Even though model performance at discord QG still lags human performance by more than 15%, generated questions are judged to be more interesting than factoid questions and can reveal differences in the level of detail, sentiment, and reasoning of sources in news coverage. Code is available at

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets
Philippe Laban | Chien-Sheng Wu | Wenhao Liu | Caiming Xiong
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model’s output over another is often necessary.However, human evaluation is usually costly, difficult to reproduce, and non-reusable.In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests.In an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error.Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on.Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest that NND can give a second life to human annotations and provide low-cost NLG evaluation.

SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization
Philippe Laban | Tobias Schnabel | Paul N. Bennett | Marti A. Hearst
Transactions of the Association for Computational Linguistics, Volume 10

In the summarization domain, a key requirement for summaries is to be factually consistent with the input document. Previous work has found that natural language inference (NLI) models do not perform competitively when applied to inconsistency detection. In this work, we revisit the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level). We provide a highly effective and light-weight method called SummaCConv that enables NLI models to be successfully used for this task by segmenting documents into sentence units and aggregating scores between pairs of sentences. We furthermore introduce a new benchmark called SummaC (Summary Consistency) which consists of six large inconsistency detection datasets. On this dataset, SummaCConv obtains state-of-the-art results with a balanced accuracy of 74.4%, a 5% improvement compared with prior work.


Keep It Simple: Unsupervised Simplification of Multi-Paragraph Text
Philippe Laban | Tobias Schnabel | Paul Bennett | Marti A. Hearst
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This work presents Keep it Simple (KiS), a new approach to unsupervised text simplification which learns to balance a reward across three properties: fluency, salience and simplicity. We train the model with a novel algorithm to optimize the reward (k-SCST), in which the model proposes several candidate simplifications, computes each candidate’s reward, and encourages candidates that outperform the mean reward. Finally, we propose a realistic text comprehension task as an evaluation method for text simplification. When tested on the English news domain, the KiS model outperforms strong supervised baselines by more than 4 SARI points, and can help people complete a comprehension task an average of 18% faster while retaining accuracy, when compared to the original text.

Can Transformer Models Measure Coherence In Text: Re-Thinking the Shuffle Test
Philippe Laban | Luke Dai | Lucas Bandarkar | Marti A. Hearst
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The Shuffle Test is the most common task to evaluate whether NLP models can measure coherence in text. Most recent work uses direct supervision on the task; we show that by simply finetuning a RoBERTa model, we can achieve a near perfect accuracy of 97.8%, a state-of-the-art. We argue that this outstanding performance is unlikely to lead to a good model of text coherence, and suggest that the Shuffle Test should be approached in a Zero-Shot setting: models should be evaluated without being trained on the task itself. We evaluate common models in this setting, such as Generative and Bi-directional Transformers, and find that larger architectures achieve high-performance out-of-the-box. Finally, we suggest the k-Block Shuffle Test, a modification of the original by increasing the size of blocks shuffled. Even though human reader performance remains high (around 95% accuracy), model performance drops from 94% to 78% as block size increases, creating a conceptually simple challenge to benchmark NLP models.

News Headline Grouping as a Challenging NLU Task
Philippe Laban | Lucas Bandarkar | Marti A. Hearst
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent progress in Natural Language Understanding (NLU) has seen the latest models outperform human performance on many standard tasks. These impressive results have led the community to introspect on dataset limitations, and iterate on more nuanced challenges. In this paper, we introduce the task of HeadLine Grouping (HLG) and a corresponding dataset (HLGD) consisting of 20,056 pairs of news headlines, each labeled with a binary judgement as to whether the pair belongs within the same group. On HLGD, human annotators achieve high performance of around 0.9 F-1, while current state-of-the art Transformer models only reach 0.75 F-1, opening the path for further improvements. We further propose a novel unsupervised Headline Generator Swap model for the task of HeadLine Grouping that achieves within 3 F-1 of the best supervised model. Finally, we analyze high-performing models with consistency tests, and find that models are not consistent in their predictions, revealing modeling limits of current architectures.


The Summary Loop: Learning to Write Abstractive Summaries Without Examples
Philippe Laban | Andrew Hsi | John Canny | Marti A. Hearst
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This work presents a new approach to unsupervised abstractive summarization based on maximizing a combination of coverage and fluency for a given length constraint. It introduces a novel method that encourages the inclusion of key terms from the original document into the summary: key terms are masked out of the original document and must be filled in by a coverage model using the current generated summary. A novel unsupervised training procedure leverages this coverage model along with a fluency model to generate and score summaries. When tested on popular news summarization datasets, the method outperforms previous unsupervised methods by more than 2 R-1 points, and approaches results of competitive supervised methods. Our model attains higher levels of abstraction with copied passages roughly two times shorter than prior work, and learns to compress and merge sentences without supervision.

What’s The Latest? A Question-driven News Chatbot
Philippe Laban | John Canny | Marti A. Hearst
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

This work describes an automatic news chatbot that draws content from a diverse set of news articles and creates conversations with a user about the news. Key components of the system include the automatic organization of news articles into topical chatrooms, integration of automatically generated questions into the conversation, and a novel method for choosing which questions to present which avoids repetitive suggestions. We describe the algorithmic framework and present the results of a usability study that shows that news readers using the system successfully engage in multi-turn conversations about specific news stories.


pdf bib
newsLens: building and visualizing long-ranging news stories
Philippe Laban | Marti Hearst
Proceedings of the Events and Stories in the News Workshop

We propose a method to aggregate and organize a large, multi-source dataset of news articles into a collection of major stories, and automatically name and visualize these stories in a working system. The approach is able to run online, as new articles are added, processing 4 million news articles from 20 news sources, and extracting 80000 major stories, some of which span several years. The visual interface consists of lanes of timelines, each annotated with information that is deemed important for the story, including extracted quotations. The working system allows a user to search and navigate 8 years of story information.