“A Little is Enough”: Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation

Akshay Batheja, Pushpak Bhattacharyya


Abstract
Quality Estimation (QE) is the task of evaluating the quality of a translation when reference translation is not available. The goal of QE aligns with the task of corpus filtering, where we assign the quality score to the sentence pairs present in the pseudo-parallel corpus. We propose a Quality Estimation based Filtering approach to extract high-quality parallel data from the pseudo-parallel corpus. To the best of our knowledge, this is a novel adaptation of QE framework to extracting quality parallel corpus from the pseudo-parallel corpus.. By training with this filtered corpus, we observe an improvement in the Machine Translation (MT) system’s performance by up to 1.8 BLEU points, for English-Marathi, Chinese-English, and Hindi-Bengali language pairs, over the baseline model. The baseline model is the one that is trained on the whole pseudo-parallel corpus. Our Few-shot QE model transfer learned from the English-Marathi QE model and fine-tuned on only 500 Hindi-Bengali training instances, shows an improvement of up to 0.6 BLEU points for Hindi-Bengali language pair, compared to the baseline model. This demonstrates the promise of transfer learning in the setting under discussion. QE systems typically require in the order of (7K-25K) of training data. Our Hindi-Bengali QE is trained on only 500 instances of training that is 1/40th of the normal requirement and achieves comparable performance. All the scripts and datasets utilized in this study will be publicly available.
Anthology ID:
2023.findings-acl.892
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14175–14185
Language:
URL:
https://aclanthology.org/2023.findings-acl.892
DOI:
10.18653/v1/2023.findings-acl.892
Bibkey:
Cite (ACL):
Akshay Batheja and Pushpak Bhattacharyya. 2023. “A Little is Enough”: Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14175–14185, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
“A Little is Enough”: Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation (Batheja & Bhattacharyya, Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2023.findings-acl.892.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-5/2023.findings-acl.892.mp4