Abstract
Research in multi-document summarization has focused on newswire corpora since the early beginnings. However, the newswire genre provides genre-specific features such as sentence position which are easy to exploit in summarization systems. Such easy to exploit genre-specific features are available in other genres as well. We therefore present the new hMDS corpus for multi-document summarization, which contains heterogeneous source documents from multiple text genres, as well as summaries with different lengths. For the construction of the corpus, we developed a novel construction approach which is suited to build large and heterogeneous summarization corpora with little effort. The method reverses the usual process of writing summaries for given source documents: it combines already available summaries with appropriate source documents. In a detailed analysis, we show that our new corpus is significantly different from the homogeneous corpora commonly used, and that it is heterogeneous along several dimensions. Our experimental evaluation using well-known state-of-the-art summarization systems shows that our corpus poses new challenges in the field of multi-document summarization. Last but not least, we make our corpus publicly available to the research community at the corpus web page https://github.com/AIPHES/hMDS.- Anthology ID:
- C16-1145
- Volume:
- Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
- Month:
- December
- Year:
- 2016
- Address:
- Osaka, Japan
- Editors:
- Yuji Matsumoto, Rashmi Prasad
- Venue:
- COLING
- SIG:
- Publisher:
- The COLING 2016 Organizing Committee
- Note:
- Pages:
- 1535–1545
- Language:
- URL:
- https://aclanthology.org/C16-1145
- DOI:
- Cite (ACL):
- Markus Zopf, Maxime Peyrard, and Judith Eckle-Kohler. 2016. The Next Step for Multi-Document Summarization: A Heterogeneous Multi-Genre Corpus Built with a Novel Construction Approach. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1535–1545, Osaka, Japan. The COLING 2016 Organizing Committee.
- Cite (Informal):
- The Next Step for Multi-Document Summarization: A Heterogeneous Multi-Genre Corpus Built with a Novel Construction Approach (Zopf et al., COLING 2016)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/C16-1145.pdf
- Code
- AIPHES/hMDS