Abstract
Recently, large-scale datasets have vastly facilitated the development in nearly all domains of Natural Language Processing. However, there is currently no cross-task dataset in NLP, which hinders the development of multi-task learning. We propose MATINF, the first jointly labeled large-scale dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and user-generated question descriptions. Based on such rich information, MATINF is applicable for three major NLP tasks, including classification, question answering, and summarization. We benchmark existing methods and a novel multi-task baseline over MATINF to inspire further research. Our comprehensive comparison and experiments over MATINF and other datasets demonstrate the merits held by MATINF.- Anthology ID:
- 2020.acl-main.330
- Volume:
- Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2020
- Address:
- Online
- Editors:
- Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3586–3596
- Language:
- URL:
- https://aclanthology.org/2020.acl-main.330
- DOI:
- 10.18653/v1/2020.acl-main.330
- Cite (ACL):
- Canwen Xu, Jiaxin Pei, Hongtao Wu, Yiyu Liu, and Chenliang Li. 2020. MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3586–3596, Online. Association for Computational Linguistics.
- Cite (Informal):
- MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization (Xu et al., ACL 2020)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/2020.acl-main.330.pdf
- Code
- WHUIR/MATINF
- Data
- MATINF, AG News, DuReader, LCSTS, MS MARCO, NEWSROOM