MLSUM: The Multilingual Summarization Corpus
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano
Abstract
We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.- Anthology ID:
- 2020.emnlp-main.647
- Volume:
- Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8051–8067
- Language:
- URL:
- https://aclanthology.org/2020.emnlp-main.647
- DOI:
- 10.18653/v1/2020.emnlp-main.647
- Cite (ACL):
- Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2020. MLSUM: The Multilingual Summarization Corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8051–8067, Online. Association for Computational Linguistics.
- Cite (Informal):
- MLSUM: The Multilingual Summarization Corpus (Scialom et al., EMNLP 2020)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2020.emnlp-main.647.pdf
- Data
- MLSUM, BigPatent, CNN/Daily Mail, LCSTS, MLQA, NEWSROOM, New York Times Annotated Corpus, SNLI