CATAMARAN: A Cross-lingual Long Text Abstractive Summarization Dataset

Zheng Chen, Hongyu Lin


Abstract
Cross-lingual summarization, which produces the summary in one language from a given source document in another language, could be extremely helpful for humans to obtain information across the world. However, it is still a little-explored task due to the lack of datasets. Recent studies are primarily based on pseudo-cross-lingual datasets obtained by translation. Such an approach would inevitably lead to the loss of information in the original document and introduce noise into the summary, thus hurting the overall performance. In this paper, we present CATAMARAN, the first high-quality cross-lingual long text abstractive summarization dataset. It contains about 20,000 parallel news articles and corresponding summaries, all written by humans. The average lengths of articles are 1133.65 for English articles and 2035.33 for Chinese articles, and the average lengths of the summaries are 26.59 and 70.05, respectively. We train and evaluate an mBART-based cross-lingual abstractive summarization model using our dataset. The result shows that, compared with mono-lingual systems, the cross-lingual abstractive summarization system could also achieve solid performance.
Anthology ID:
2022.lrec-1.749
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6932–6937
Language:
URL:
https://aclanthology.org/2022.lrec-1.749
DOI:
Bibkey:
Cite (ACL):
Zheng Chen and Hongyu Lin. 2022. CATAMARAN: A Cross-lingual Long Text Abstractive Summarization Dataset. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6932–6937, Marseille, France. European Language Resources Association.
Cite (Informal):
CATAMARAN: A Cross-lingual Long Text Abstractive Summarization Dataset (Chen & Lin, LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/2022.lrec-1.749.pdf