Abstract
Most earlier work on text summarization is carried out on news article datasets. The summary in these datasets is naturally located at the beginning of the text. Hence, a model can spuriously utilize this correlation for summary generation instead of truly learning to summarize. To address this issue, we constructed a new dataset, SumPubMed , using scientific articles from the PubMed archive. We conducted a human analysis of summary coverage, redundancy, readability, coherence, and informativeness on SumPubMed . SumPubMed is challenging because (a) the summary is distributed throughout the text (not-localized on top), and (b) it contains rare domain-specific scientific terms. We observe that seq2seq models that adequately summarize news articles struggle to summarize SumPubMed . Thus, SumPubMed opens new avenues for the future improvement of models as well as the development of new evaluation metrics.- Anthology ID:
- 2021.acl-srw.30
- Volume:
- Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
- Month:
- August
- Year:
- 2021
- Address:
- Online
- Editors:
- Jad Kabbara, Haitao Lin, Amandalynne Paullada, Jannis Vamvas
- Venues:
- ACL | IJCNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 292–303
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2021.acl-srw.30/
- DOI:
- 10.18653/v1/2021.acl-srw.30
- Cite (ACL):
- Vivek Gupta, Prerna Bharti, Pegah Nokhiz, and Harish Karnick. 2021. SumPubMed: Summarization Dataset of PubMed Scientific Articles. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 292–303, Online. Association for Computational Linguistics.
- Cite (Informal):
- SumPubMed: Summarization Dataset of PubMed Scientific Articles (Gupta et al., ACL-IJCNLP 2021)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2021.acl-srw.30.pdf
- Code
- vgupta123/sumpubmed
- Data
- CNN/Daily Mail