The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation
Chris van der Lee, Chris Emmery, Sander Wubben, Emiel Krahmer
Abstract
This paper describes the CACAPO dataset, built for training both neural pipeline and end-to-end data-to-text language generation systems. The dataset is multilingual (Dutch and English), and contains almost 10,000 sentences from human-written news texts in the sports, weather, stocks, and incidents domain, together with aligned attribute-value paired data. The dataset is unique in that the linguistic variation and indirect ways of expressing data in these texts reflect the challenges of real world NLG tasks.- Anthology ID:
- 2020.inlg-1.10
- Volume:
- Proceedings of the 13th International Conference on Natural Language Generation
- Month:
- December
- Year:
- 2020
- Address:
- Dublin, Ireland
- Venue:
- INLG
- SIG:
- SIGGEN
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 68–79
- Language:
- URL:
- https://aclanthology.org/2020.inlg-1.10
- DOI:
- Cite (ACL):
- Chris van der Lee, Chris Emmery, Sander Wubben, and Emiel Krahmer. 2020. The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation. In Proceedings of the 13th International Conference on Natural Language Generation, pages 68–79, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation (van der Lee et al., INLG 2020)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2020.inlg-1.10.pdf
- Data
- RotoWire, WebNLG