Lujo Bauer


2026

Large language models (LLMs) are rapidly being adopted for tasks like draftingemails, summarizing meetings, and answering health questions. In thesesettings, users may need to share private information (e.g., contactdetails, health records). To evaluate LLMs’ ability to identify and redactsuch information, prior work introduced real-life, scenario-based benchmarks(e.g., ConfAIde, PrivacyLens) and found that LLMs can leak privateinformation in complex scenarios. However, these evaluations relied on proxy LLMs to judge the helpfulnessand privacy-preservation quality of LLM responses, rather than directlymeasuring users’ perceptions. To understand how users perceive the helpfulness and privacy-preservationquality of LLM responses to privacy-sensitive scenarios, we conducted auser study (n=94) using 90 PrivacyLens scenarios. We found that users hadlow agreement with each other when evaluating identical LLM responses. Incontrast, five proxy LLMs reached high agreement, yet each proxy LLM hadlow correlation with users’ evaluations. These results indicate that proxy LLMs cannot accurately estimate users’ wide range of perceptions of utility and privacy inprivacy-sensitive scenarios. We discuss the need for more user-centeredstudies to measure LLMs’ ability to help users while preserving privacy,and for improving alignment between LLMs and users in estimating perceivedprivacy and utility.

2025

Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility, one of which is to measure the consistency of LLM responses–the model’s confidence in the response or likelihood of generating a similar response when resampled. In previous work, measuring LLM response consistency often relied on calculating the probability of a response appearing within a pool of resampled responses, analyzing internal states, or evaluating logits of resopnses. However, it was not clear how well these approaches approximated users’ perceptions of consistency of LLM responses. To find out, we performed a user study (n=2,976) demonstrating that current methods for measuring LLM response consistency typically do not align well with humans’ perceptions of LLM consistency. We propose a logit-based ensemble method for estimating LLM consistency and show that our method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods for estimating LLM consistency without human evaluation are sufficiently imperfect to warrant broader use of evaluation with human input; this would avoid misjudging the adequacy of models because of the imperfections of automated consistency metrics.