Wenkai Li


2026

Large language models (LLMs) are rapidly being adopted for tasks like draftingemails, summarizing meetings, and answering health questions. In thesesettings, users may need to share private information (e.g., contactdetails, health records). To evaluate LLMs’ ability to identify and redactsuch information, prior work introduced real-life, scenario-based benchmarks(e.g., ConfAIde, PrivacyLens) and found that LLMs can leak privateinformation in complex scenarios. However, these evaluations relied on proxy LLMs to judge the helpfulnessand privacy-preservation quality of LLM responses, rather than directlymeasuring users’ perceptions. To understand how users perceive the helpfulness and privacy-preservationquality of LLM responses to privacy-sensitive scenarios, we conducted auser study (n=94) using 90 PrivacyLens scenarios. We found that users hadlow agreement with each other when evaluating identical LLM responses. Incontrast, five proxy LLMs reached high agreement, yet each proxy LLM hadlow correlation with users’ evaluations. These results indicate that proxy LLMs cannot accurately estimate users’ wide range of perceptions of utility and privacy inprivacy-sensitive scenarios. We discuss the need for more user-centeredstudies to measure LLMs’ ability to help users while preserving privacy,and for improving alignment between LLMs and users in estimating perceivedprivacy and utility.

2025

Addressing contextual privacy concerns remains challenging in interactive settings where large language models (LLMs) process information from multiple sources. Building on the theory of contextual integrity, we introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks—extraction, classification—reducing the information load on any single agent while enabling iterative validation and more reliable adherence to contextual privacy norms. Experiments on the ConfAIde benchmark with two LLMs (GPT-4, Llama3) demonstrate that our multi-agent system substantially reduces private information leakage (36% reduction) while maintaining the fidelity of public content compared to a single-agent system, showing the promise of multi-agent frameworks towards contextual privacy with LLMs.
The field of machine translation has achieved significant advancements, yet domain-specific terminology translation, particularly in AI, remains challenging. This work introduces GIST, a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023. The terms were translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation. The dataset’s quality was benchmarked against existing resources, demonstrating superior translation accuracy through crowdsourced evaluation. GIST was integrated into translation workflows using post-translation refinement methods that required no retraining, where LLM prompting consistently improved BLEU and COMET scores. A web demonstration on the ACL Anthology platform highlights its practical application, showcasing improved accessibility for non-English speakers. We address a critical gap in AI terminology resources and fosters global inclusivity and collaboration in AI research.
In this work, we tackle the challenge of embedding realistic human personality traits into LLMs. Previous approaches have primarily focused on prompt-based methods that describe the behavior associated with the desired personality traits, suffering from realism and validity issues. To address these limitations, we introduce BIG5-CHAT, a large-scale dataset containing 100,000 dialogues designed to ground models in how humans express their personality in text. Leveraging this dataset, we explore Supervised Fine-Tuning and Direct Preference Optimization as training-based methods to align LLMs more naturally with human personality patterns. Our methods outperform prompting on personality assessments such as BFI and IPIP-NEO, with trait correlations more closely matching human data. Furthermore, our experiments reveal that models trained to exhibit higher conscientiousness, higher agreeableness, lower extraversion, and lower neuroticism display better performance on reasoning tasks, aligning with psychological findings on how these traits impact human cognitive performance. To our knowledge, this work is the first comprehensive study to demonstrate how training-based methods can shape LLM personalities through learning from real human behaviors.

2024

In an era of model and data proliferation in machine learning/AI especially marked by the rapid advancement of open-sourced technologies, there arises a critical need for standardized consistent documentation. Our work addresses the information incompleteness in current human-written model and data cards. We propose an automated generation approach using Large Language Models (LLMs). Our key contributions include the establishment of CardBench, a comprehensive dataset aggregated from over 4.8k model cards and 1.4k data cards, coupled with the development of the CardGen pipeline comprising a two-step retrieval process. Our approach exhibits enhanced completeness, objectivity, and faithfulness in generated model and data cards, a significant step in responsible AI documentation practices ensuring better accountability and traceability.