Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data
Abstract
Despite considerable progress, most machine reading comprehension (MRC) tasks still lack sufficient training data to fully exploit powerful deep neural network models with millions of parameters, and it is laborious, expensive, and time-consuming to create large-scale, high-quality MRC data through crowdsourcing. This paper focuses on generating more training data for MRC tasks by leveraging existing question-answering (QA) data. We first collect a large-scale multi-subject multiple-choice QA dataset for Chinese, ExamQA. We next use incomplete, yet relevant snippets returned by a web search engine as the context for each QA instance to convert it into a weakly-labeled MRC instance. To better use the weakly-labeled data to improve a target MRC task, we evaluate and compare several methods and further propose a self-teaching paradigm. Experimental results show that, upon state-of-the-art MRC baselines, we can obtain +5.1% in accuracy on a multiple-choice Chinese MRC dataset, Cˆ3, and +3.8% in exact match on an extractive Chinese MRC dataset, CMRC 2018, demonstrating the usefulness of the generated QA-based weakly-labeled data for different types of MRC tasks as well as the effectiveness of self-teaching. ExamQA will be available at https://dataset.org/examqa/.- Anthology ID:
- 2021.findings-emnlp.6
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2021
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- Findings
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 56–68
- Language:
- URL:
- https://aclanthology.org/2021.findings-emnlp.6
- DOI:
- 10.18653/v1/2021.findings-emnlp.6
- Cite (ACL):
- Dian Yu, Kai Sun, Dong Yu, and Claire Cardie. 2021. Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 56–68, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data (Yu et al., Findings 2021)
- PDF:
- https://preview.aclanthology.org/emnlp-22-attachments/2021.findings-emnlp.6.pdf
- Data
- C3, CMRC, CMRC 2018, DRCD, HeadQA, JEC-QA, MedQA