Multi-Document Summarization with Centroid-Based Pretraining
Ratish Surendran Puduppully, Parag Jain, Nancy Chen, Mark Steedman
Abstract
In Multi-Document Summarization (MDS), the input can be modeled as a set of documents, and the output is its summary. In this paper, we focus on pretraining objectives for MDS. Specifically, we introduce a novel pretraining objective, which involves selecting the ROUGE-based centroid of each document cluster as a proxy for its summary. Our objective thus does not require human written summaries and can be utilized for pretraining on a dataset consisting solely of document sets. Through zero-shot, few-shot, and fully supervised experiments on multiple MDS datasets, we show that our model Centrum is better or comparable to a state-of-the-art model. We make the pretrained and fine-tuned models freely available to the research communityhttps://github.com/ratishsp/centrum.- Anthology ID:
- 2023.acl-short.13
- Volume:
- Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 128–138
- Language:
- URL:
- https://aclanthology.org/2023.acl-short.13
- DOI:
- Cite (ACL):
- Ratish Surendran Puduppully, Parag Jain, Nancy Chen, and Mark Steedman. 2023. Multi-Document Summarization with Centroid-Based Pretraining. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 128–138, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Multi-Document Summarization with Centroid-Based Pretraining (Puduppully et al., ACL 2023)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2023.acl-short.13.pdf