2025
pdf
bib
abs
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
|
Wenxiang Jiao
|
Wenxuan Wang
|
Jen-tse Huang
|
Jiahao Xu
|
Tian Liang
|
Pinjia He
|
Zhaopeng Tu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models’ ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses baseline methods in defending against attacks.
pdf
bib
abs
Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs
Wenxuan Wang
|
Xiaoyuan Liu
|
Kuiyi Gao
|
Jen-tse Huang
|
Youliang Yuan
|
Pinjia He
|
Shuai Wang
|
Zhaopeng Tu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images. However, ensuring the safety of these models remains a significant challenge, particularly in accurately identifying whether multimodal content is safe or unsafe—a capability we term safety awareness. In this paper, we introduce MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios with 1,500 carefully curated image-prompt pairs. MMSafeAware includes both unsafe and over-safety subsets to assess models’ abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness. Evaluating nine widely used MLLMs using MMSafeAware reveals that current models are not sufficiently safe and often overly sensitive; for example, GPT-4V misclassifies 36.1% of unsafe inputs as safe and 59.9% of benign inputs as unsafe. We further explore three methods to improve safety awareness—prompting-based approaches, visual contrastive decoding, and vision-centric reasoning fine-tuning—but find that none achieve satisfactory performance. Our findings highlight the profound challenges in developing MLLMs with robust safety awareness, underscoring the need for further research in this area. All the code and data will be publicly available to facilitate future research.
pdf
bib
abs
Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs
Xiaoyuan Liu
|
Wenxuan Wang
|
Youliang Yuan
|
Jen-tse Huang
|
Qiuzhi Liu
|
Pinjia He
|
Zhaopeng Tu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper explores the problem of commonsense level vision-knowledge conflict in Multimodal Large Language Models (MLLMs), where visual information contradicts model’s internal commonsense knowledge. To study this issue, we introduce an automated framework, augmented with human-in-the-loop quality control, to generate inputs designed to simulate and evaluate these conflicts in MLLMs. Using this framework, we have crafted a diagnostic benchmark consisting of 374 original images and 1,122 high-quality question-answer (QA) pairs. The benchmark covers two aspects of conflict and three question types, providing a thorough assessment tool. We apply this benchmark to assess the conflict-resolution capabilities of nine representative MLLMs from various model families. Our results indicate an evident over-reliance on parametric knowledge for approximately 20% of all queries, especially among Yes-No and action-related problems. Based on these findings, we evaluate the effectiveness of existing approaches to mitigating the conflicts and compare them to our “Focus-on-Vision” prompting strategy. Despite some improvement, the vision-knowledge conflict remains unresolved and can be further scaled through our data construction framework. Our proposed framework, benchmark, and analysis contribute to the understanding and mitigation of vision-knowledge conflicts in MLLMs.
pdf
bib
abs
Chain-of-Jailbreak Attack for Image Generation Models via Step by Step Editing
Wenxuan Wang
|
Kuiyi Gao
|
Youliang Yuan
|
Jen-tse Huang
|
Qiuzhi Liu
|
Shuai Wang
|
Wenxiang Jiao
|
Zhaopeng Tu
Findings of the Association for Computational Linguistics: ACL 2025
Text-based image generation models, such as Stable Diffusion and DALL-E 3, hold significant potential in content creation and publishing workflows, making them the focus in recent years. Despite their remarkable capability to generate diverse and vivid images, considerable efforts are being made to prevent the generation of harmful content, such as abusive, violent, or pornographic material. To assess the safety of existing models, we introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process. Specifically, for malicious queries that cannot bypass the safeguards with a single prompt, we intentionally decompose the query into multiple sub-queries. The image generation models are then prompted to generate and iteratively edit images based on these sub-queries. To evaluate the effectiveness of our CoJ attack method, we constructed a comprehensive dataset, CoJ-Bench, including nine safety scenarios, three types of editing operations, and three editing elements. Experiments on four widely-used image generation services provided by GPT-4V, GPT-4o, Gemini 1.5 and Gemini 1.5 Pro, demonstrate that our CoJ attack method can successfully bypass the safeguards of models for over 60% cases, which significantly outperforms other jailbreaking methods (i.e., 14%). Further, to enhance these models’ safety against our CoJ attack method, we also propose an effective prompting-based method, Think-Twice Prompting, that can successfully defend over 95% of CoJ attack. Our dataset and code are included in the supplementary materials and will be made publicly available upon publication.
pdf
bib
abs
SOTOPIA-S4: a user-friendly system for flexible, customizable, and large-scale social simulation
Xuhui Zhou
|
Zhe Su
|
Sophie Feng
|
Jiaxu Zhou
|
Jen-tse Huang
|
Hsien-Te Kao
|
Spencer Lynch
|
Svitlana Volkova
|
Tongshuang Wu
|
Anita Woolley
|
Hao Zhu
|
Maarten Sap
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Social simulation through large language model (LLM) agents is a promising approach to explore and validate social science hypotheses.We present SOTOPIA-S4, a fast, flexible, and scalable social simulation system that addresses the technical barriers of current frameworks while enabling practitioners to generate realistic, multi-turn and multi-party interactions with customizable evaluation metrics for hypothesis testing. SOTOPIA-S4 comes as a pip package that contains a simulation engine, an API server with flexible RESTful APIs for simulation management, and a web interface that enables both technical and non-technical users to design, run, and analyze simulations without programming. We demonstrate the usefulness of SOTOPIA-S4 with two use cases involving dyadic hiring negotiation scenarios and multi-party planning scenarios.
2024
pdf
bib
abs
InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews
Xintao Wang
|
Yunze Xiao
|
Jen-tse Huang
|
Siyu Yuan
|
Rui Xu
|
Haoran Guo
|
Quan Tu
|
Yaying Fei
|
Ziang Leng
|
Wei Wang
|
Jiangjie Chen
|
Cheng Li
|
Yanghua Xiao
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Role-playing agents (RPAs), powered by large language models, have emerged as a flourishing field of applications. However, a key challenge lies in assessing whether RPAs accurately reproduce the personas of target characters, namely their character fidelity. Existing methods mainly focus on the knowledge and linguistic patterns of characters. This paper, instead, introduces a novel perspective to evaluate the personality fidelity of RPAs with psychological scales. Overcoming drawbacks of previous self-report assessments on RPAs, we propose InCharacter, namely **In**terviewing **Character** agents for personality tests. Experiments include various types of RPAs and LLMs, covering 32 distinct characters on 14 widely used psychological scales. The results validate the effectiveness of InCharacter in measuring RPA personalities. Then, with InCharacter, we show that state-of-the-art RPAs exhibit personalities highly aligned with the human-perceived personalities of the characters, achieving an accuracy up to 80.7%.
pdf
bib
abs
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models
Wenxuan Wang
|
Wenxiang Jiao
|
Jingyuan Huang
|
Ruyi Dai
|
Jen-tse Huang
|
Zhaopeng Tu
|
Michael Lyu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper identifies a cultural dominance issue within large language models (LLMs) due to the predominant use of English data in model training (e.g., ChatGPT). LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages. To systematically evaluate the cultural dominance issue, we build a benchmark of concrete (e.g., holidays and songs) and abstract (e.g., values and opinions) cultural objects. Empirical results show that the representative GPT models suffer from the culture dominance problem, where GPT-4 is the most affected while text-davinci-003 suffers the least from this problem. Our study emphasizes the need to critically examine cultural dominance and ethical considerations in their development and deployment. We show that two straightforward methods in model development (i.e., pretraining on more diverse data) and deployment (e.g., culture-aware prompting) can significantly mitigate the cultural dominance issue in LLMs.
pdf
bib
abs
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models
Yuxuan Wan
|
Wenxuan Wang
|
Yiliu Yang
|
Youliang Yuan
|
Jen-tse Huang
|
Pinjia He
|
Wenxiang Jiao
|
Michael Lyu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) such as ChatGPT and GPT-4. Despite LLMs’ prowess in tasks like writing assistance, code generation, and machine translation, assessing their ability to reason has been challenging. Traditional evaluations often prioritize accuracy on downstream tasks over direct assessments of reasoning processes. LogicAsker addresses this gap by employing a set of atomic reasoning skills grounded in propositional and predicate logic to systematically examine and improve the reasoning prowess of LLMs. Our methodology reveals significant gaps in LLMs’ learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. Moreover, we leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%. To our knowledge, this is the first effort to utilize test case outcomes to effectively refine LLMs’ formal reasoning capabilities. We make our code, data, and results publicly available(https://github.com/yxwan123/LogicAsker) to facilitate further research and replication of our findings.
pdf
bib
abs
On the Reliability of Psychological Scales on Large Language Models
Jen-tse Huang
|
Wenxiang Jiao
|
Man Ho Lam
|
Eric John Li
|
Wenxuan Wang
|
Michael Lyu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent research has focused on examining Large Language Models’ (LLMs) characteristics from a psychological standpoint, acknowledging the necessity of understanding their behavioral characteristics. The administration of personality tests to LLMs has emerged as a noteworthy area in this context. However, the suitability of employing psychological scales, initially devised for humans, on LLMs is a matter of ongoing debate. Our study aims to determine the reliability of applying personality assessments to LLMs, explicitly investigating whether LLMs demonstrate consistent personality traits. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory, indicating a satisfactory level of reliability. Furthermore, our research explores the potential of GPT-3.5 to emulate diverse personalities and represent various groups—a capability increasingly sought after in social sciences for substituting human participants with LLMs to reduce costs. Our findings reveal that LLMs have the potential to represent different personalities with specific prompt instructions.
pdf
bib
abs
InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context
Ziyi Liu
|
Abhishek Anand
|
Pei Zhou
|
Jen-tse Huang
|
Jieyu Zhao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have demonstrated the potential to mimic human social intelligence. However, most studies focus on simplistic and static self-report or performance-based tests, which limits the depth and validity of the analysis. In this paper, we developed a novel framework, InterIntent, to assess LLMs’ social intelligence by mapping their ability to understand and manage intentions in a game setting. We focus on four dimensions of social intelligence: situational awareness, self-regulation, self-awareness, and theory of mind. Each dimension is linked to a specific game task: intention selection, intention following, intention summarization, and intention guessing. Our findings indicate that while LLMs exhibit high proficiency in selecting intentions, achieving an accuracy of 88%, their ability to infer the intentions of others is significantly weaker, trailing human performance by 20%. Additionally, game performance correlates with intention understanding, highlighting the importance of the four components towards success in this game. These findings underline the crucial role of intention understanding in evaluating LLMs’ social intelligence and highlight the potential of using social deduction games as a complex testbed to enhance LLM evaluation. InterIntent contributes a structured approach to bridging the evaluation gap in social intelligence within multiplayer LLM-based games.
pdf
bib
abs
All Languages Matter: On the Multilingual Safety of LLMs
Wenxuan Wang
|
Zhaopeng Tu
|
Chang Chen
|
Youliang Yuan
|
Jen-tse Huang
|
Wenxiang Jiao
|
Michael Lyu
Findings of the Association for Computational Linguistics: ACL 2024
Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose a simple and effective prompting method to improve the multilingual safety of ChatGPT by enhancing cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses by 42% for non-English queries. We will release all the data and results to facilitate future research on LLMs’ safety.
2023
pdf
bib
abs
ParroT: Translating during Chat using Large Language Models tuned with Human Translation and Feedback
Wenxiang Jiao
|
Jen-tse Huang
|
Wenxuan Wang
|
Zhiwei He
|
Tian Liang
|
Xing Wang
|
Shuming Shi
|
Zhaopeng Tu
Findings of the Association for Computational Linguistics: EMNLP 2023
Large language models (LLMs) like ChatGPT have exhibited remarkable abilities on a wide range of natural language processing (NLP) tasks, including various machine translation abilities accomplished during chat. However, these models are only accessible through restricted APIs, which creates barriers to new research and advancements in the field. Therefore, we propose ParroT, a framework to enhance and regulate the translation abilities during chat based on open-source LLMs (e.g., LLaMA), human-written translation and feedback data. Specifically, ParroT reformulates translation data into the instruction-following style, and introduces a “Hint” field for incorporating extra requirements to regulate the translation process. Accordingly, we propose three instruction types for finetuning ParroT models, including translation instruction, contrastive instruction, and error-guided instruction. Experiments on Flores subsets and WMT22 test sets suggest that translation instruction improves the translation performance of vanilla LLMs significantly while error-guided instruction can lead to further improvement, which demonstrates the importance of learning from low-quality translations annotated by humans. We also demonstrate the potential of automatic evaluation tools in providing quality information of translations, when constructing error-guided instructions for directions that lack human annotation data. Please refer to our Github project for more implementation details: https://github.com/wxjiao/ParroT.
2022
pdf
bib
abs
Tencent’s Multilingual Machine Translation System for WMT22 Large-Scale African Languages
Wenxiang Jiao
|
Zhaopeng Tu
|
Jiarui Li
|
Wenxuan Wang
|
Jen-tse Huang
|
Shuming Shi
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper describes Tencent’s multilingual machine translation systems for the WMT22 shared task on Large-Scale Machine Translation Evaluation for African Languages. We participated in the constrained translation track in which only the data and pretrained models provided by the organizer are allowed. The task is challenging due to three problems, including the absence of training data for some to-be-evaluated language pairs, the uneven optimization of language pairs caused by data imbalance, and the curse of multilinguality. To address these problems, we adopt data augmentation, distributionally robust optimization, and language family grouping, respectively, to develop our multilingual neural machine translation (MNMT) models. Our submissions won the 1st place on the blind test sets in terms of the automatic evaluation metrics. Codes, models, and detailed competition results are available at
https://github.com/wxjiao/WMT2022-Large-Scale-African.