Tianhao Li

2025

As Large Language Models (LLMs) become more advanced, the security risks they pose also increase. Ensuring that LLM behavior aligns with human values, particularly in mitigating jailbreak attacks with elusive and implicit intentions, has become a significant challenge. To address this issue, we propose a jailbreak defense method called Real Intentions Defense (RID), which involves two phases: soft extraction and hard deletion. In the soft extraction phase, LLMs are leveraged to extract unbiased, genuine intentions, while in the hard deletion phase, a greedy gradient-based algorithm is used to remove the least important parts of a sentence, based on the insight that words with smaller gradients have less impact on its meaning. We conduct extensive experiments on Vicuna and Llama2 models using eight state-of-the-art jailbreak attacks and six benchmark datasets. Our results show a significant reduction in both Attack Success Rate (ASR) and Harmful Score of jailbreak attacks, while maintaining overall model performance. Further analysis sheds light on the underlying mechanisms of our approach.

Although Large Language Models (LLMs) succeed in human-guided conversations such as instruction following and question answering, the potential of LLM-guided conversations—where LLMs direct the discourse and steer the conversation’s objectives—remains under-explored. In this study, we first characterize LLM-guided conversation into three fundamental components: (i) Goal Navigation; (ii) Context Management; (iii) Empathetic Engagement, and propose GuideLLM as an installation. We then implement an interviewing environment for the evaluation of LLM-guided conversation. Specifically, various topics are involved in this environment for comprehensive interviewing evaluation, resulting in around 1.4k turns of utterances, 184k tokens, and over 200 events mentioned during the interviewing for each chatbot evaluation. We compare GuideLLM with 6 state-of-the-art LLMs such as GPT-4o and Llama-3-70b-Instruct, from the perspective of interviewing quality, and autobiography generation quality. For automatic evaluation, we derive user proxies from multiple autobiographies and employ LLM-as-a-judge to score LLM behaviors. We further conduct a human-involved experiment by employing 45 human participants to chat with GuideLLM and baselines. We then collect human feedback, preferences, and ratings regarding the qualities of conversation and autobiography. Experimental results indicate that GuideLLM significantly outperforms baseline LLMs in automatic evaluation and achieves consistent leading performances in human ratings.

2023

When consumers’ shopping needs are concentrated, they are more interested in the collection of commodities under the specific marketing theme. Therefore, mining marketing themes and their commodities collections can help customers save shopping costs and improve user clicks and purchases for recommendation system. However, the current system invites experts to write marketing themes and select the relevant commodities, which suffer from difficulty in mass production, poor timeliness and low online indicators. Therefore, we propose a automatic marketing theme and commodity construction system, which can not only generate popular marketing themes and select the relevant commodities automatically, but also improve the theme online effectiveness in the recommendation system. Specifically, we firstly utilize the pretrained language model to generate the marketing themes. And then, we utilize the theme-commodity consistency module to select the relevant commodities for the above generative theme. What’s more, we also build the indicator simulator to evaluate the effectiveness of the above generative theme. When the indicator is lower, the above selective commodities will be input into the theme-rewriter module to generate more efficient marketing themes. Finally, we utilize the human screening to control the system quality. Both the offline experiments and online A/B test demonstrate the superior performance of our proposed system compared with state-of-the-art methods.