Joyce Jiyoung Whang


2025

pdf bib
Beneath the Facade: Probing Safety Vulnerabilities in LLMs via Auto-Generated Jailbreak Prompts
Heehyeon Kim | Kyeongryul Lee | Joyce Jiyoung Whang
Findings of the Association for Computational Linguistics: EMNLP 2025

The rapid proliferation of large language models and multimodal generative models has raised concerns about their potential vulnerabilities to a wide range of real-world safety risks. However, a critical gap persists in systematic assessment, alongside the lack of evaluation frameworks to keep pace with the breadth and variability of real-world risk factors. In this paper, we introduce TroGEN, an automated jailbreak prompt generation framework that assesses these vulnerabilities by deriving scenario-driven jailbreak prompts using an adversarial agent. Moving beyond labor-intensive dataset construction, TroGEN features an extensible design that covers broad range of risks, supports plug-and-play jailbreak strategies, and adapts seamlessly to multimodal settings. Experimental results demonstrate that TroGEN effectively uncovers safety weaknesses, revealing susceptibilities to adversarial attacks that conceal malicious intent beneath an apparently benign facade, like a Trojan horse. Furthermore, such stealthy attacks exhibit resilience even against existing jailbreak defense methods.

2024

pdf bib
Why So Gullible? Enhancing the Robustness of Retrieval-Augmented Models against Counterfactual Noise
Giwon Hong | Jeonghwan Kim | Junmo Kang | Sung-Hyon Myaeng | Joyce Jiyoung Whang
Findings of the Association for Computational Linguistics: NAACL 2024

Most existing retrieval-augmented language models (LMs) assume a naive dichotomy within a retrieved document set: query-relevance and irrelevance. Our work investigates a more challenging scenario in which even the “relevant” documents may contain misleading or incorrect information, causing conflict among the retrieved documents and thereby negatively influencing model decisions as noise. We observe that existing LMs are highly brittle to the presence of conflicting information in both the fine-tuning and in-context few-shot learning scenarios. We propose approaches for handling knowledge conflicts among retrieved documents by explicitly fine-tuning a discriminator or prompting GPT-3.5 to elicit its discriminative capability. Our empirical results on open-domain QA show that these approaches significantly enhance model robustness. We also provide our findings on incorporating the fine-tuned discriminator’s decision into the in-context learning process, proposing a way to exploit the benefits of two disparate learning schemes. Alongside our findings, we provide MacNoise, a machine-generated, conflict-induced dataset to further encourage research in this direction.

2023

pdf bib
FinePrompt: Unveiling the Role of Finetuned Inductive Bias on Compositional Reasoning in GPT-4
Jeonghwan Kim | Giwon Hong | Sung-Hyon Myaeng | Joyce Jiyoung Whang
Findings of the Association for Computational Linguistics: EMNLP 2023

Compositional reasoning across texts has been a long-standing challenge in natural language processing. With large language models like GPT-4 taking over the field, prompting techniques such as chain-of-thought (CoT) were proposed to unlock compositional, multi-step reasoning capabilities of LLMs. Despite their success, the prompts demand significant human effort to discover and validate them. Our work draws attention to the idea of transferring task-specific inductive biases from finetuned models to prompts, as a way of improving GPT-4’s compositional reasoning capabilities. To leverage these inductive biases, we formulate prompt templates to ease the transfer of inductive biases. The experimental results on multi-hop question answering and numerical reasoning over text show that our proposed prompt scheme shows competitive zero-shot and few-shot performances compared to existing prompts on complicated reasoning tasks, highlighting the importance of adopting the validated biases of the previous paradigm.

pdf bib
VISTA: Visual-Textual Knowledge Graph Representation Learning
Jaejun Lee | Chanyoung Chung | Hochang Lee | Sungho Jo | Joyce Jiyoung Whang
Findings of the Association for Computational Linguistics: EMNLP 2023

Knowledge graphs represent human knowledge using triplets composed of entities and relations. While most existing knowledge graph embedding methods only consider the structure of a knowledge graph, a few recently proposed multimodal methods utilize images or text descriptions of entities in a knowledge graph. In this paper, we propose visual-textual knowledge graphs (VTKGs), where not only entities but also triplets can be explained using images, and both entities and relations can accompany text descriptions. By compiling visually expressible commonsense knowledge, we construct new benchmark datasets where triplets themselves are explained by images, and the meanings of entities and relations are described using text. We propose VISTA, a knowledge graph representation learning method for VTKGs, which incorporates the visual and textual representations of entities and relations using entity encoding, relation encoding, and triplet decoding transformers. Experiments show that VISTA outperforms state-of-the-art knowledge graph completion methods in real-world VTKGs.