Recent work has introduced an important yet relatively under-explored NLP task called Semantic Overlap Summarization (SOS) that entails generating a summary from multiple alternative narratives which conveys the common information provided by those narratives. Previous work also published a benchmark dataset for this task by collecting 2,925 alternative narrative pairs from the web and manually annotating 411 different reference summaries by engaging human annotators. In this paper, we exclusively focus on the automated evaluation of the SOS task using the benchmark dataset. More specifically, we first use the popular ROUGE metric from text-summarization literature and conduct a systematic study to evaluate the SOS task. Our experiments discover that ROUGE is not suitable for this novel task and therefore, we propose a new sentence-level precision-recall style automated evaluation metric, called SEM-F1 (Semantic F1). It is inspired by the benefits of the sentence-wise annotation technique using overlap labels reported by the previous work. Our experiments show that the proposed SEM-F1 metric yields a higher correlation with human judgment and higher inter-rater agreement compared to the ROUGE metric.
Semantic Overlap Summarization (SOS) is a novel and relatively under-explored seq-to-seq task which entails summarizing common information from multiple alternate narratives. One of the major challenges for solving this task is the lack of existing datasets for supervised training. To address this challenge, we propose a novel data augmentation technique, which allows us to create large amount of synthetic data for training a seq-to-seq model that can perform the SOS task. Through extensive experiments using narratives from the news domain, we show that the models fine-tuned using the synthetic dataset provide significant performance improvements over the pre-trained vanilla summarization techniques and are close to the models fine-tuned on the golden training data; which essentially demonstrates the effectiveness of out proposed data augmentation technique for training seq-to-seq models on the SOS task.
In this paper, we introduce an important yet relatively unexplored NLP task called Semantic Overlap Summarization (SOS), which entails generating a single summary from multiple alternative narratives which can convey the common information provided by those narratives. As no benchmark dataset is readily available for this task, we created one by collecting 2,925 alternative narrative pairs from the web and then, went through the tedious process of manually creating 411 different reference summaries by engaging human annotators. As a way to evaluate this novel task, we first conducted a systematic study by borrowing the popular ROUGE metric from text-summarization literature and discovered that ROUGE is not suitable for our task. Subsequently, we conducted further human annotations to create 200 document-level and 1,518 sentence-level ground-truth overlap labels. Our experiments show that the sentence-wise annotation technique with three overlap labels, i.e., Absent (A), Partially-Present (PP), and Present (P), yields a higher correlation with human judgment and higher inter-rater agreement compared to the ROUGE metric.
It has been the norm for a long time to evaluate automated summarization tasks using the popular ROUGE metric. Although several studies in the past have highlighted the limitations of ROUGE, researchers have struggled to reach a consensus on a better alternative until today. One major limitation of the traditional ROUGE metric is the lack of semantic understanding (relies on direct overlap of n-grams). In this paper, we exclusively focus on the extractive summarization task and propose a semantic-aware nCG (normalized cumulative gain)-based evaluation metric (called Sem-nCG) for evaluating this task. One fundamental contribution of the paper is that it demonstrates how we can generate more reliable semantic-aware ground truths for evaluating extractive summarization tasks without any additional human intervention. To the best of our knowledge, this work is the first of its kind. We have conducted extensive experiments with this new metric using the widely used CNN/DailyMail dataset. Experimental results show that the new Sem-nCG metric is indeed semantic-aware, shows higher correlation with human judgement (more reliable) and yields a large number of disagreements with the original ROUGE metric (suggesting that ROUGE often leads to inaccurate conclusions also verified by humans).