2025
pdf
bib
abs
Representation Bending for Large Language Model Safety
Ashkan Yousefpour
|
Taeheon Kim
|
Ryan Sungmo Kwon
|
Seungbeen Lee
|
Wonje Jeung
|
Seungju Han
|
Alvin Wan
|
Harrison Ngan
|
Youngjae Yu
|
Jonghyun Choi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks – ranging from harmful content generation to broader societal harms – pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering – simple vector arithmetic for steering model’s behavior during inference – to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.
pdf
bib
abs
Persona Dynamics: Unveiling the Impact of Persona Traits on Agents in Text-Based Games
Seungwon Lim
|
Seungbeen Lee
|
Dongjun Min
|
Youngjae Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Artificial agents are increasingly central to complex interactions and decision-making tasks, yet aligning their behaviors with desired human values remains an open challenge. In this work, we investigate how human-like personality traits influence agent behavior and performance within text-based interactive environments. We introduce PANDA: Personality Adapted Neural Decision Agents, a novel method for projecting human personality traits onto agents to guide their behavior. To induce personality in a text-based game agent, (i) we train a personality classifier to identify what personality type the agent’s actions exhibit, and (ii) we integrate the personality profiles directly into the agent’s policy-learning pipeline. By deploying agents embodying 16 distinct personality types across 25 text-based games and analyzing their trajectories, we demonstrate that an agent’s action decisions can be guided toward specific personality profiles. Moreover, certain personality types, such as those characterized by higher levels of Openness, display marked advantages in performance. These findings underscore the promise of personality-adapted agents for fostering more aligned, effective, and human-centric decision-making in interactive environments.
pdf
bib
abs
Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Seungbeen Lee
|
Seungwon Lim
|
Seungju Han
|
Giyeong Oh
|
Hyungjoo Chae
|
Jiwan Chung
|
Minju Kim
|
Beong-woo Kwak
|
Yeonsoo Lee
|
Dongha Lee
|
Jinyoung Yeo
|
Youngjae Yu
Findings of the Association for Computational Linguistics: NAACL 2025
Recent advancements in Large Language Models (LLMs) have led to their adaptation in various domains as conversational agents. We wonder: can personality tests be applied to these agents to analyze their behavior, similar to humans? We introduce TRAIT, a new benchmark consisting of 8K multi-choice questions designed to assess the personality of LLMs. TRAIT is built on two psychometrically validated small human questionnaires, Big Five Inventory (BFI) and Short Dark Triad (SD-3), enhanced with the ATOMIC-10X knowledge graph to a variety of real-world scenarios. TRAIT also outperforms existing personality tests for LLMs in terms of reliability and validity, achieving the highest scores across four key metrics: Content Validity, Internal Validity, Refusal Rate, and Reliability. Using TRAIT, we reveal two notable insights into personalities of LLMs: 1) LLMs exhibit distinct and consistent personality, which is highly influenced by their training data (e.g., data used for alignment tuning), and 2) current prompting techniques have limited effectiveness in eliciting certain traits, such as high psychopathy or low conscientiousness, suggesting the need for further research in this direction.
2024
pdf
bib
abs
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!
Jiwan Chung
|
Seungwon Lim
|
Jaehyun Jeon
|
Seungbeen Lee
|
Youngjae Yu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail images or textbook illustrations. Is it possible for machines to achieve a similar multimodal understanding capability?In response, we present Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Puns serve as the ideal subject for this evaluation due to their intrinsic ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. We pose three multimodal challenges with the annotations to assess different aspects of multimodal literacy; Pun Grounding, Disambiguation, and Reconstruction. The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context, particularly as the complexity of the tasks increases.
pdf
bib
abs
Cactus: Towards Psychological Counseling Conversations using Cognitive Behavioral Theory
Suyeon Lee
|
Sunghwan Kim
|
Minju Kim
|
Dongjin Kang
|
Dongil Yang
|
Harim Kim
|
Minseok Kang
|
Dayi Jung
|
Min Hee Kim
|
Seungbeen Lee
|
Kyong-Mee Chung
|
Youngjae Yu
|
Dongha Lee
|
Jinyoung Yeo
Findings of the Association for Computational Linguistics: EMNLP 2024
Recently, the demand for psychological counseling has significantly increased as more individuals express concerns about their mental health. This surge has accelerated efforts to improve the accessibility of counseling by using large language models (LLMs) as counselors. To ensure client privacy, training open-source LLMs faces a key challenge: the absence of realistic counseling datasets. To address this, we introduce Cactus, a multi-turn dialogue dataset that emulates real-life interactions using the goal-oriented and structured approach of Cognitive Behavioral Therapy (CBT).We create a diverse and realistic dataset by designing clients with varied, specific personas, and having counselors systematically apply CBT techniques in their interactions. To assess the quality of our data, we benchmark against established psychological criteria used to evaluate real counseling sessions, ensuring alignment with expert evaluations.Experimental results demonstrate that Camel, a model trained with Cactus, outperforms other models in counseling skills, highlighting its effectiveness and potential as a counseling agent.We make our data, model, and code publicly available.