Markus Dreyer


Evaluating the Tradeoff Between Abstractiveness and Factuality in Abstractive Summarization
Markus Dreyer | Mengwen Liu | Feng Nan | Sandeep Atluri | Sujith Ravi
Findings of the Association for Computational Linguistics: EACL 2023

Neural models for abstractive summarization tend to generate output that is fluent and well-formed but lacks semantic faithfulness, or factuality, with respect to the input documents. In this paper, we analyze the tradeoff between abstractiveness and factuality of generated summaries across multiple datasets and models, using extensive human evaluations of factuality. In our analysis, we visualize the rates of change in factuality as we gradually increase abstractiveness using a decoding constraint, and we observe that, while increased abstractiveness generally leads to a drop in factuality, the rate of factuality decay depends on factors such as the data that the system was trained on. We introduce two datasets with human factuality judgements; one containing 10.2k generated summaries with systematically varied degrees of abstractiveness; the other containing 4.2k summaries from five different summarization models. We propose new factuality metrics that adjust for the degree of abstractiveness, and we use them to compare the abstractiveness-adjusted factuality of previous summarization works, providing baselines for future work.

Enhancing Multi-Document Summarization with Cross-Document Graph-based Information Extraction
Zixuan Zhang | Heba Elfardy | Markus Dreyer | Kevin Small | Heng Ji | Mohit Bansal
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Information extraction (IE) and summarization are closely related, both tasked with presenting a subset of the information contained in a natural language text. However, while IE extracts structural representations, summarization aims to abstract the most salient information into a generated text summary – thus potentially encountering the technical limitations of current text generation methods (e.g., hallucination). To mitigate this risk, this work uses structured IE graphs to enhance the abstractive summarization task. Specifically, we focus on improving Multi-Document Summarization (MDS) performance by using cross-document IE output, incorporating two novel components: (1) the use of auxiliary entity and event recognition systems to focus the summary generation model; (2) incorporating an alignment loss between IE nodes and their text spans to reduce inconsistencies between the IE graphs and text representations. Operationally, both the IE nodes and corresponding text spans are projected into the same embedding space and pairwise distance is minimized. Experimental results on multiple MDS benchmarks show that summaries generated by our model are more factually consistent with the source documents than baseline models while maintaining the same level of abstractiveness.

Faithfulness-Aware Decoding Strategies for Abstractive Summarization
David Wan | Mengwen Liu | Kathleen Mckeown | Markus Dreyer | Mohit Bansal
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Despite significant progress in understanding and improving faithfulness in abstractive summarization, the question of how decoding strategies affect faithfulness is less studied. We present a systematic study of the effect of generation techniques such as beam search and nucleus sampling on faithfulness in abstractive summarization. We find a consistent trend where beam search with large beam sizes produces the most faithful summaries while nucleus sampling generates the least faithful ones. We propose two faithfulness-aware generation methods to further improve faithfulness over current generation techniques: (1) ranking candidates generated by beam search using automatic faithfulness metrics and (2) incorporating lookahead heuristics that produce a faithfulness score on the future summary. We show that both generation methods significantly improve faithfulness across two datasets as evaluated by four automatic faithfulness metrics and human evaluation. To reduce computational cost, we demonstrate a simple distillation approach that allows the model to generate faithful summaries with just greedy decoding.


FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations
Leonardo F. R. Ribeiro | Mengwen Liu | Iryna Gurevych | Markus Dreyer | Mohit Bansal
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Despite recent improvements in abstractive summarization, most current approaches generate summaries that are not factually consistent with the source document, severely restricting their trust and usage in real-world applications. Recent works have shown promising improvements in factuality error identification using text or dependency arc entailments; however, they do not consider the entire semantic graph simultaneously. To this end, we propose FactGraph, a method that decomposes the document and the summary into structured meaning representations (MR), which are more suitable for factuality evaluation. MRs describe core semantic concepts and their relations, aggregating the main content in both document and summary in a canonical form, and reducing data sparsity. FactGraph encodes such graphs using a graph encoder augmented with structure-aware adapters to capture interactions among the concepts based on the graph connectivity, along with text representations using an adapter-based text encoder. Experiments on different benchmarks for evaluating factuality show that FactGraph outperforms previous approaches by up to 15%. Furthermore, FactGraph improves performance on identifying content verifiability errors and better captures subsentence-level factual inconsistencies.

Efficient Few-Shot Fine-Tuning for Opinion Summarization
Arthur Brazinskas | Ramesh Nallapati | Mohit Bansal | Markus Dreyer
Findings of the Association for Computational Linguistics: NAACL 2022

Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples. However, in opinion summarization, large annotated datasets of reviews paired with reference summaries are not available and would be expensive to create. This calls for fine-tuning methods robust to overfitting on small datasets. In addition, generically pre-trained models are often not accustomed to the specifics of customer reviews and, after fine-tuning, yield summaries with disfluencies and semantic mistakes. To address these problems, we utilize an efficient few-shot method based on adapters which, as we show, can easily store in-domain knowledge. Instead of fine-tuning the entire model, we add adapters and pre-train them in a task-specific way on a large corpus of unannotated customer reviews, using held-out reviews as pseudo summaries. Then, fine-tune the adapters on the small available human-annotated dataset. We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets, respectively. Finally, for summary personalization, we condition on aspect keyword queries, automatically created from generic datasets. In the same vein, we pre-train the adapters in a query-based manner on customer reviews and then fine-tune them on annotated datasets. This results in better-organized summary content reflected in improved coherence and fewer redundancies.


Rewards with Negative Examples for Reinforced Topic-Focused Abstractive Summarization
Khalil Mrini | Can Liu | Markus Dreyer
Proceedings of the Third Workshop on New Frontiers in Summarization

We consider the problem of topic-focused abstractive summarization, where the goal is to generate an abstractive summary focused on a particular topic, a phrase of one or multiple words. We hypothesize that the task of generating topic-focused summaries can be improved by showing the model what it must not focus on. We introduce a deep reinforcement learning approach to topic-focused abstractive summarization, trained on rewards with a novel negative example baseline. We define the input in this problem as the source text preceded by the topic. We adapt the CNN-Daily Mail and New York Times summarization datasets for this task. We then show through experiments on existing rewards that the use of a negative example baseline can outperform the use of a self-critical baseline, in Rouge, BERTScore, and human evaluation metrics.

Efficiently Summarizing Text and Graph Encodings of Multi-Document Clusters
Ramakanth Pasunuru | Mengwen Liu | Mohit Bansal | Sujith Ravi | Markus Dreyer
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This paper presents an efficient graph-enhanced approach to multi-document summarization (MDS) with an encoder-decoder Transformer model. This model is based on recent advances in pre-training both encoder and decoder on very large text data (Lewis et al., 2019), and it incorporates an efficient encoding mechanism (Beltagy et al., 2020) that avoids the quadratic memory growth typical for traditional Transformers. We show that this powerful combination not only scales to large input documents commonly found when summarizing news clusters; it also enables us to process additional input in the form of auxiliary graph representations, which we derive from the multi-document clusters. We present a mechanism to incorporate such graph information into the encoder-decoder model that was pre-trained on text only. Our approach leads to significant improvements on the Multi-News dataset, overall leading to an average 1.8 ROUGE score improvement over previous work (Li et al., 2020). We also show improvements in a transfer-only setup on the DUC-2004 dataset. The graph encodings lead to summaries that are more abstractive. Human evaluation shows that they are also more informative and factually more consistent with their input documents.


Multi-Task Networks with Universe, Group, and Task Feature Learning
Shiva Pentyala | Mengwen Liu | Markus Dreyer
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present methods for multi-task learning that take advantage of natural groupings of related tasks. Task groups may be defined along known properties of the tasks, such as task domain or language. Such task groups represent supervised information at the inter-task level and can be encoded into the model. We investigate two variants of neural network architectures that accomplish this, learning different feature spaces at the levels of individual tasks, task groups, as well as the universe of all tasks: (1) parallel architectures encode each input simultaneously into feature spaces at different levels; (2) serial architectures encode each input successively into feature spaces at different levels in the task hierarchy. We demonstrate the methods on natural language understanding (NLU) tasks, where a grouping of tasks into different task domains leads to improved performance on ATIS, Snips, and a large in-house dataset.


Transfer Learning for Neural Semantic Parsing
Xing Fan | Emilio Monti | Lambert Mathias | Markus Dreyer
Proceedings of the 2nd Workshop on Representation Learning for NLP

The goal of semantic parsing is to map natural language to a machine interpretable meaning representation language (MRL). One of the constraints that limits full exploration of deep learning technologies for semantic parsing is the lack of sufficient annotation training data. In this paper, we propose using sequence-to-sequence in a multi-task setup for semantic parsing with focus on transfer learning. We explore three multi-task architectures for sequence-to-sequence model and compare their performance with the independently trained model. Our experiments show that the multi-task setup aids transfer learning from an auxiliary task with large labeled data to the target task with smaller labeled data. We see an absolute accuracy gain ranging from 1.0% to 4.4% in in our in-house data set and we also see good gains ranging from 2.5% to 7.0% on the ATIS semantic parsing tasks with syntactic and semantic auxiliary tasks.


APRO: All-Pairs Ranking Optimization for MT Tuning
Markus Dreyer | Yuanzhe Dong
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

hyp: A Toolkit for Representing, Manipulating, and Optimizing Hypergraphs
Markus Dreyer | Jonathan Graehl
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations


HyTER: Meaning-Equivalent Semantics for Translation Evaluation
Markus Dreyer | Daniel Marcu
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies


Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model
Markus Dreyer | Jason Eisner
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing


Graphical Models over Multiple Strings
Markus Dreyer | Jason Eisner
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing


Machine Translation System Combination using ITG-based Alignments
Damianos Karakos | Jason Eisner | Sanjeev Khudanpur | Markus Dreyer
Proceedings of ACL-08: HLT, Short Papers

Latent-Variable Modeling of String Transductions with Finite-State Methods
Markus Dreyer | Jason Smith | Jason Eisner
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing


Comparing Reordering Constraints for SMT Using Efficient BLEU Oracle Computation
Markus Dreyer | Keith Hall | Sanjeev Khudanpur
Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation


Better Informed Training of Latent Syntactic Features
Markus Dreyer | Jason Eisner
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

Vine Parsing and Minimum Risk Reranking for Speed and Precision
Markus Dreyer | David A. Smith | Noah A. Smith
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)