Byung-Jun Lee


2025

pdf bib
K/DA: Automated Data Generation Pipeline for Detoxifying Implicitly Offensive Language in Korean
Minkyeong Jeon | Hyemin Jeong | Yerang Kim | Jiyoung Kim | Jae Hyeon Cho | Byung-Jun Lee
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language detoxification involves removing toxicity from offensive language. While a neutral-toxic paired dataset provides a straightforward approach for training detoxification models, creating such datasets presents several challenges: i) the need for human annotation to build paired data, and ii) the rapid evolution of offensive terms, rendering static datasets quickly outdated. To tackle these challenges, we introduce an automated paired data generation pipeline, called K/DA. This pipeline is designed to generate offensive language with implicit offensiveness and trend-aligned slang, making the resulting dataset suitable for detoxification model training. We demonstrate that the dataset generated by K/DA exhibits high pair consistency and greater implicit offensiveness compared to existing Korean datasets, and also demonstrates applicability to other languages. Furthermore, it enables effective training of a high-performing detoxification model with simple instruction fine-tuning.

2023

pdf bib
Quantifying Information of Tokens for Simple and Flexible Simultaneous Machine Translation
DongHyun Lee | Minkyung Park | Byung-Jun Lee
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Simultaneous Translation (ST) involves translating with only partial source inputs instead of the entire source inputs, a process that can potentially result in translation quality degradation. Previous approaches to balancing translation quality and latency have demonstrated that it is more efficient and effective to leverage an offline model with a reasonable policy. However, using an offline model also leads to a distribution shift since it is not trained with partial source inputs, and it can be improved by training an additional module that informs us when to translate. In this paper, we propose an Information Quantifier (IQ) that models source and target information to determine whether the offline model has sufficient information for translation, trained with oracle action sequences generated from the offline model. IQ, by quantifying information, helps in formulating a suitable policy for Simultaneous Translation that better generalizes and also allows us to control the trade-off between quality and latency naturally. Experiments on various language pairs show that our proposed model outperforms baselines.

pdf bib
Improving Neural Machine Translation with Offline Evaluations
Min-Kyung Park | Byung-Jun Lee
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

2014

pdf bib
Optimizing Generative Dialog State Tracker via Cascading Gradient Descent
Byung-Jun Lee | Woosang Lim | Daejoong Kim | Kee-Eung Kim
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)