Yunju Bak


2025

pdf bib
PROM: Pivoted and Regulated Optimization for Multilingual Instruction Learning
Jaeseong Lee | Seung-won Hwang | Hojin Lee | Yunju Bak | Changmin Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Large language models (LLMs) have become standard for natural language generation tasks, with instruction-tuning enhancing their capabilities. However, the lack of instruction-tuning datasets in languages other than English limits their application to diverse languages. To address this, researchers have adapted English-centric LLMs to other languages by appending English tuning data with its translated pair, from which we observe negative interference between the two. To resolve this, our contribution is identifying English as an internal pivot language, based on which we disentangle the roles of English and target language data in training. Specifically, we first design two roles as pivoted objectives, and also propose to regulate between the two, to better generalize for under-represented languages. Experiments across various languages demonstrate the effectiveness of our approach on multiple benchmarks. The code is publicly available for further exploration.

2021

pdf bib
Kakao Enterprise’s WMT21 Machine Translation Using Terminologies Task Submission
Yunju Bak | Jimin Sun | Jay Kim | Sungwon Lyu | Changmin Lee
Proceedings of the Sixth Conference on Machine Translation

This paper describes Kakao Enterprise’s submission to the WMT21 shared Machine Translation using Terminologies task. We integrate terminology constraints by pre-training with target lemma annotations and fine-tuning with exact target annotations utilizing the given terminology dataset. This approach yields a model that achieves outstanding results in terms of both translation quality and term consistency, ranking first based on COMET in the En→Fr language direction. Furthermore, we explore various methods such as back-translation, explicitly training terminologies as additional parallel data, and in-domain data selection.