Automatically Discarding Straplines to Improve Data Quality for Abstractive News Summarization
Amr Keleg, Matthias Lindemann, Danyang Liu, Wanqiu Long, Bonnie L. Webber
Abstract
Recent improvements in automatic news summarization fundamentally rely on large corpora of news articles and their summaries. These corpora are often constructed by scraping news websites, which results in including not only summaries but also other kinds of texts. Apart from more generic noise, we identify straplines as a form of text scraped from news websites that commonly turn out not to be summaries. The presence of these non-summaries threatens the validity of scraped corpora as benchmarks for news summarization. We have annotated extracts from two news sources that form part of the Newsroom corpus (Grusky et al., 2018), labeling those which were straplines, those which were summaries, and those which were both. We present a rule-based strapline detection method that achieves good performance on a manually annotated test set. Automatic evaluation indicates that removing straplines and noise from the training data of a news summarizer results in higher quality summaries, with improvements as high as 7 points ROUGE score.- Anthology ID:
- 2022.nlppower-1.5
- Volume:
- Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Editors:
- Tatiana Shavrina, Vladislav Mikhailov, Valentin Malykh, Ekaterina Artemova, Oleg Serikov, Vitaly Protasov
- Venue:
- nlppower
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 42–51
- Language:
- URL:
- https://aclanthology.org/2022.nlppower-1.5
- DOI:
- 10.18653/v1/2022.nlppower-1.5
- Cite (ACL):
- Amr Keleg, Matthias Lindemann, Danyang Liu, Wanqiu Long, and Bonnie L. Webber. 2022. Automatically Discarding Straplines to Improve Data Quality for Abstractive News Summarization. In Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP, pages 42–51, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Automatically Discarding Straplines to Improve Data Quality for Abstractive News Summarization (Keleg et al., nlppower 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/2022.nlppower-1.5.pdf
- Data
- CNN/Daily Mail, NEWSROOM