Hantae Kim


2022

pdf
DaLC: Domain Adaptation Learning Curve Prediction for Neural Machine Translation
Cheonbok Park | Hantae Kim | Ioan Calapodescu | Hyun Chang Cho | Vassilina Nikoulina
Findings of the Association for Computational Linguistics: ACL 2022

Domain Adaptation (DA) of Neural Machine Translation (NMT) model often relies on a pre-trained general NMT model which is adapted to the new domain on a sample of in-domain parallel data. Without parallel data, there is no way to estimate the potential benefit of DA, nor the amount of parallel samples it would require. It is however a desirable functionality that could help MT practitioners to make an informed decision before investing resources in dataset creation. We propose a Domain adaptation Learning Curve prediction (DaLC) model that predicts prospective DA performance based on in-domain monolingual samples in the source language. Our model relies on the NMT encoder representations combined with various instance and corpus-level features. We demonstrate that instance-level is better able to distinguish between different domains compared to corpus-level frameworks proposed in previous studies Finally, we perform in-depth analyses of the results highlighting the limitations of our approach, and provide directions for future research.

pdf
Specializing Multi-domain NMT via Penalizing Low Mutual Information
Jiyoung Lee | Hantae Kim | Hyunchang Cho | Edward Choi | Cheonbok Park
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Multi-domain Neural Machine Translation (NMT) trains a single model with multiple domains. It is appealing because of its efficacy in handling multiple domains within one model. An ideal multi-domain NMT learns distinctive domain characteristics simultaneously, however, grasping the domain peculiarity is a non-trivial task. In this paper, we investigate domain-specific information through the lens of mutual information (MI) and propose a new objective that penalizes low MI to become higher.Our method achieved the state-of-the-art performance among the current competitive multi-domain NMT models. Also, we show our objective promotes low MI to be higher resulting in domain-specialized multi-domain NMT.

2021

pdf
Papago’s Submission for the WMT21 Quality Estimation Shared Task
Seunghyun Lim | Hantae Kim | Hyunjoong Kim
Proceedings of the Sixth Conference on Machine Translation

This paper describes Papago submission to the WMT 2021 Quality Estimation Task 1: Sentence-level Direct Assessment. Our multilingual Quality Estimation system explores the combination of Pretrained Language Models and Multi-task Learning architectures. We propose an iterative training pipeline based on pretraining with large amounts of in-domain synthetic data and finetuning with gold (labeled) data. We then compress our system via knowledge distillation in order to reduce parameters yet maintain strong performance. Our submitted multilingual systems perform competitively in multilingual and all 11 individual language pair settings including zero-shot.