GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval

Timo Möller; Julian Risch; Malte Pietsch

doi:10.18653/v1/2021.mrqa-1.4

GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval

Timo Möller, Julian Risch, Malte Pietsch

Abstract

A major challenge of research on non-English machine reading for question answering (QA) is the lack of annotated datasets. In this paper, we present GermanQuAD, a dataset of 13,722 extractive question/answer pairs. To improve the reproducibility of the dataset creation approach and foster QA research on other languages, we summarize lessons learned and evaluate reformulation of question/answer pairs as a way to speed up the annotation process. An extractive QA model trained on GermanQuAD significantly outperforms multilingual models and also shows that machine-translated training data cannot fully substitute hand-annotated training data in the target language. Finally, we demonstrate the wide range of applications of GermanQuAD by adapting it to GermanDPR, a training dataset for dense passage retrieval (DPR), and train and evaluate one of the first non-English DPR models.

Anthology ID:: 2021.mrqa-1.4
Volume:: Proceedings of the 3rd Workshop on Machine Reading for Question Answering
Month:: November
Year:: 2021
Address:: Punta Cana, Dominican Republic
Editors:: Adam Fisch, Alon Talmor, Danqi Chen, Eunsol Choi, Minjoon Seo, Patrick Lewis, Robin Jia, Sewon Min
Venue:: MRQA
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 42–50
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2021.mrqa-1.4/
DOI:: 10.18653/v1/2021.mrqa-1.4
Bibkey:
Cite (ACL):: Timo Möller, Julian Risch, and Malte Pietsch. 2021. GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 42–50, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval (Möller et al., MRQA 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2021.mrqa-1.4.pdf

PDF Cite Search Fix data