Yeonjea Kim
2026
Incomplete Prompt Jailbreaks in Large Language Models
Yeonjea Kim | Bumjin Park | Jaesik Choi
Findings of the Association for Computational Linguistics: ACL 2026
Yeonjea Kim | Bumjin Park | Jaesik Choi
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) are increasingly released as open-weight models with safeguards against harmful requests. Nevertheless, sentence completion remains vulnerable to incomplete harmful prompts. In this work, we formalize this phenomenon as incomplete prompt jailbreaks (IPJ) and provide a systematic empirical characterization of when and how incomplete prompts elicit harmful continuations. We analyze diverse attractor types associated with incomplete sentence continuation and show that LLMs systematically delay refusal until sentence termination. We further demonstrate that training models to refuse incomplete harmful prompts via parameter tuning is insufficient, failing to generalize across both content domains and attractor types. To enable fine-grained control, we identify two functional neurons: termination and continuation neurons. By clarifying their roles in sentence completion, we highlight the potential of neuron-level interventions for more precise and robust IPJ defenses.
2025
When Format Changes Meaning: Investigating Semantic Inconsistency of Large Language Models
Cheongwoong Kang | Jongeun Baek | Yeonjea Kim | Jaesik Choi
Findings of the Association for Computational Linguistics: EMNLP 2025
Cheongwoong Kang | Jongeun Baek | Yeonjea Kim | Jaesik Choi
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks. However, they remain vulnerable to semantic inconsistency, where minor formatting variations result in divergent predictions for semantically equivalent inputs. Our comprehensive evaluation reveals that this brittleness persists even in state-of-the-art models such as GPT-4o, posing a serious challenge to their reliability. Through a mechanistic analysis, we find that semantic-equivalent input changes induce instability in internal representations, ultimately leading to divergent predictions. This reflects a deeper structural issue, where form and meaning are intertwined in the embedding space. We further demonstrate that existing mitigation strategies, including direct fine-tuning on format variations, do not fully address semantic inconsistency, underscoring the difficulty of the problem. Our findings highlight the need for deeper mechanistic understanding to develop targeted methods that improve robustness.