Unsupervised Domain Adaptation for Question Generation with DomainData Selection and Self-training

Peide Zhu, Claudia Hauff


Abstract
Question generation (QG) approaches based on large neural models require (i) large-scale and (ii) high-quality training data. These two requirements pose difficulties for specific application domains where training data is expensive and difficult to obtain. The trained QG models’ effectiveness can degrade significantly when they are applied on a different domain due to domain shift. In this paper, we explore an unsupervised domain adaptation approach to combat the lack of training data and domain shift issue with domain data selection and self-training. We first present a novel answer-aware strategy for domain data selection to select data with the most similarity to a new domain. The selected data are then used as pseudo-in-domain data to retrain the QG model. We then present generation confidence guided self-training with two generation confidence modeling methods (i) generated questions’ perplexity and (ii) the fluency score. We test our approaches on three large public datasets with different domain similarities, using a transformer-based pre-trained QG model. The results show that our proposed approaches outperform the baselines, and show the viability of unsupervised domain adaptation with answer-aware data selection and self-training on the QG task.
Anthology ID:
2022.findings-naacl.183
Volume:
Findings of the Association for Computational Linguistics: NAACL 2022
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2388–2401
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2022.findings-naacl.183/
DOI:
10.18653/v1/2022.findings-naacl.183
Bibkey:
Cite (ACL):
Peide Zhu and Claudia Hauff. 2022. Unsupervised Domain Adaptation for Question Generation with DomainData Selection and Self-training. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2388–2401, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Domain Adaptation for Question Generation with DomainData Selection and Self-training (Zhu & Hauff, Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2022.findings-naacl.183.pdf
Data
Natural QuestionsRACESciQ