CHIA: CHoosing Instances to Annotate for Machine Translation

Rajat Bhatnagar, Ananya Ganesh, Katharina Kann


Abstract
Neural machine translation (MT) systems have been shown to perform poorly on low-resource language pairs, for which large-scale parallel data is unavailable. Making the data annotation process faster and cheaper is therefore important to ensure equitable access to MT systems. To make optimal use of a limited annotation budget, we present CHIA (choosing instances to annotate), a method for selecting instances to annotate for machine translation. Using an existing multi-way parallel dataset of high-resource languages, we first identify instances, based on model training dynamics, that are most informative for training MT models for high-resource languages. We find that there are cross-lingual commonalities in instances that are useful for MT model training, which we use to identify instances that will be useful to train models on a new target language. Evaluating on 20 languages from two corpora, we show that training on instances selected using our method provides an average performance improvement of 1.59 BLEU over training on randomly selected instances of the same size.
Anthology ID:
2022.findings-emnlp.540
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7299–7315
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.540
DOI:
Bibkey:
Cite (ACL):
Rajat Bhatnagar, Ananya Ganesh, and Katharina Kann. 2022. CHIA: CHoosing Instances to Annotate for Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7299–7315, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
CHIA: CHoosing Instances to Annotate for Machine Translation (Bhatnagar et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.findings-emnlp.540.pdf