SumPubMed: Summarization Dataset of PubMed Scientific Articles

Vivek Gupta, Prerna Bharti, Pegah Nokhiz, Harish Karnick


Abstract
Most earlier work on text summarization is carried out on news article datasets. The summary in these datasets is naturally located at the beginning of the text. Hence, a model can spuriously utilize this correlation for summary generation instead of truly learning to summarize. To address this issue, we constructed a new dataset, SumPubMed , using scientific articles from the PubMed archive. We conducted a human analysis of summary coverage, redundancy, readability, coherence, and informativeness on SumPubMed . SumPubMed is challenging because (a) the summary is distributed throughout the text (not-localized on top), and (b) it contains rare domain-specific scientific terms. We observe that seq2seq models that adequately summarize news articles struggle to summarize SumPubMed . Thus, SumPubMed opens new avenues for the future improvement of models as well as the development of new evaluation metrics.
Anthology ID:
2021.acl-srw.30
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Month:
August
Year:
2021
Address:
Online
Editors:
Jad Kabbara, Haitao Lin, Amandalynne Paullada, Jannis Vamvas
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
292–303
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2021.acl-srw.30/
DOI:
10.18653/v1/2021.acl-srw.30
Bibkey:
Cite (ACL):
Vivek Gupta, Prerna Bharti, Pegah Nokhiz, and Harish Karnick. 2021. SumPubMed: Summarization Dataset of PubMed Scientific Articles. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 292–303, Online. Association for Computational Linguistics.
Cite (Informal):
SumPubMed: Summarization Dataset of PubMed Scientific Articles (Gupta et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2021.acl-srw.30.pdf
Video:
 https://preview.aclanthology.org/icon-24-ingestion/2021.acl-srw.30.mp4
Code
 vgupta123/sumpubmed
Data
CNN/Daily Mail