Byung-Jun Lee

2026

SGT: Securing Open-Source LLMs Against Malicious Fine-tuning via Safety Guidance Trigger
Sunguk Shin | Fangzhao Wu | Byung-Jun Lee | Meeyoung Cha | Sungwon Park
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Open-weight large language models (LLMs) enable broad customization, but also increase exposure to post-release misuse, including malicious fine-tuning (MFT). To mitigate this risk, many prior defenses aim to improve the robustness of open-weight models to MFT by constraining adversarial fine-tuning dynamics in parameter space or mitigating harmful information encoded in internal representations. Nevertheless, since malicious fine-tuning can still erode safety, developing robust safeguards for open-weight models that fundamentally mitigate this risk remains an open research problem. In this paper, we characterize a safety region for open-weight LLMs and propose Safety Guidance Trigger (SGT), which guides fine-tuning toward the safety manifold to preserve alignment. SGT has two stages: (1) optimizing a safety trigger that steers the base model toward safe responses and (2) training the open-weight model to align its internal features with trigger-induced safety representations. We demonstrate that SGT substantially improves robustness against malicious fine-tuning, requiring adversaries to increase their data budget significantly to compromise safety. Our analysis shows that SGT anchors model representations to a safety region, which remains stable under malicious fine-tuning.

pdf bib abs

While the reasoning capabilities of large language models (LLMs) have advanced considerably, efficiently internalizing and leveraging new information in dynamically interactive environments remains a significant challenge. This limitation is particularly pronounced in partially observable environments, which require agents to manage long-term memory and perform effective exploration under incomplete information. To address this, we propose an LLM agent architecture that integrates a knowledge graph as a graph-based memory module. The agent incrementally constructs the knowledge graph through environmental interactions and retrieves relevant information to generate efficient plans. We evaluate our approach in complex navigation tasks specifically designed to present long-horizon and partially observable challenges. Experimental results demonstrate that incorporating a self-extending memory module significantly enhances the performance and efficiency of the LLM’s planning capabilities.

2025

pdf bib abs

K/DA: Automated Data Generation Pipeline for Detoxifying Implicitly Offensive Language in Korean
Minkyeong Jeon | Hyemin Jeong | Yerang Kim | Jiyoung Kim | Jae Hyeon Cho | Byung-Jun Lee
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language detoxification involves removing toxicity from offensive language. While a neutral-toxic paired dataset provides a straightforward approach for training detoxification models, creating such datasets presents several challenges: i) the need for human annotation to build paired data, and ii) the rapid evolution of offensive terms, rendering static datasets quickly outdated. To tackle these challenges, we introduce an automated paired data generation pipeline, called K/DA. This pipeline is designed to generate offensive language with implicit offensiveness and trend-aligned slang, making the resulting dataset suitable for detoxification model training. We demonstrate that the dataset generated by K/DA exhibits high pair consistency and greater implicit offensiveness compared to existing Korean datasets, and also demonstrates applicability to other languages. Furthermore, it enables effective training of a high-performing detoxification model with simple instruction fine-tuning.

pdf bib abs

Rethinking DPO: The Role of Rejected Responses in Preference Misalignment
Jae Hyeon Cho | JunHyeok Oh | Myunsoo Kim | Byung-Jun Lee
Findings of the Association for Computational Linguistics: EMNLP 2025

Direct Preference Optimization (DPO) is a simple and efficient framework that has attracted substantial attention. However, it often struggles to meet its primary objectives—increasing the generation probability of chosen responses while reducing that of rejected responses—due to the dominant influence of rejected responses on the loss function. This imbalance leads to suboptimal performance in promoting preferred responses. In this work, we systematically analyze the limitations of DPO and existing algorithms designed to achieve the objectives stated above. To address these limitations, we propose Bounded-DPO (BDPO), a novel method that bounds the influence of rejected responses while maintaining the original optimization structure of DPO. Through theoretical analysis and empirical evaluations, we demonstrate that BDPO achieves a balanced optimization of the chosen and rejected responses, outperforming existing algorithms.

pdf bib abs

Iterative Prompt Refinement for Safer Text-to-Image Generation
Jinwoo Jeon | JunHyeok Oh | Hayeong Lee | Byung-Jun Lee
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Text-to-Image (T2I) models have made remarkable progress in generating images from text prompts, but their output quality and safety still depend heavily on how prompts are phrased. Existing safety methods typically refine prompts using large language models (LLMs), but they overlook the images produced, which can result in unsafe outputs or unnecessary changes to already safe prompts. To address this, we propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images. By leveraging visual feedback, our method refines prompts more effectively, improving safety while maintaining user intent and reliability comparable to existing LLM-based approaches. Additionally, we introduce a new dataset labeled with both textual and visual safety signals using off-the-shelf multi-modal LLM, enabling supervised fine-tuning. Experimental results demonstrate that our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content. \textcolor{red}{WARNING: This paper contains examples of harmful or inappropriate images generated by models.}

2023

pdf bib

Improving Neural Machine Translation with Offline Evaluations
Min-Kyung Park | Byung-Jun Lee
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib abs

Quantifying Information of Tokens for Simple and Flexible Simultaneous Machine Translation
DongHyun Lee | Minkyung Park | Byung-Jun Lee
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Simultaneous Translation (ST) involves translating with only partial source inputs instead of the entire source inputs, a process that can potentially result in translation quality degradation. Previous approaches to balancing translation quality and latency have demonstrated that it is more efficient and effective to leverage an offline model with a reasonable policy. However, using an offline model also leads to a distribution shift since it is not trained with partial source inputs, and it can be improved by training an additional module that informs us when to translate. In this paper, we propose an Information Quantifier (IQ) that models source and target information to determine whether the offline model has sufficient information for translation, trained with oracle action sequences generated from the offline model. IQ, by quantifying information, helps in formulating a suitable policy for Simultaneous Translation that better generalizes and also allows us to control the trade-off between quality and latency naturally. Experiments on various language pairs show that our proposed model outperforms baselines.

2016

pdf bib abs

Dialog History Construction with Long-Short Term Memory for Robust Generative Dialog State Tracking
Byung-Jun Lee | Kee-Eung Kim
Dialogue & Discourse Volume 7

One of the crucial components of dialog system is the dialog state tracker, which infers user’s intention from preliminary speech processing. Since the overall performance of the dialog system is heavily affected by that of the dialog tracker, it has been one of the core areas of research on dialog systems. In this paper, we present a dialog state tracker that combines a generative probabilistic model of dialog state tracking with the recurrent neural network for encoding important aspects of the dialog history. We describe a two-step gradient descent algorithm that optimizes the tracker with a complex loss function. We demonstrate that this approach yields a dialog state tracker that performs competitively with top-performing trackers participated in the first and second Dialog State Tracking Challenges.

Byung-Jun Lee

2026

2025

2023

2016

2014

Co-authors

Venues