Florian Jacob


2025

pdf bib
Can a Large Language Model Keep My Secrets? A Study on LLM-Controlled Agents
Niklas Hemken | Sai Koneru | Florian Jacob | Hannes Hartenstein | Jan Niehues
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Agents controlled by Large Language Models (LLMs) can assist with natural language tasks across domains and applications when given access to confidential data.When such digital assistants interact with their potentially adversarial environment, confidentiality of the data is at stake.We investigated whether an LLM-controlled agent can, in a manner similar to humans, consider confidentiality when responding to natural language requests involving internal data.For evaluation, we created a synthetic dataset consisting of confidentiality-aware planning and deduction tasks in organizational access control.The dataset was developed from human input, LLM-generated content, and existing datasets.It includes various everyday scenarios in which access to confidential or private information is requested.We utilized our dataset to evaluate the ability to infer confidentiality-aware behavior in such scenarios by differentiating between legitimate and illegitimate access requests.We compared a prompting-based and a fine-tuning-based approach, to evaluate the performance of Llama 3 and GPT-4o-mini in this domain.In addition, we conducted a user study to establish a baseline for human evaluation performance in these tasks. We found humans reached an accuracy of up to 79%.Prompting techniques, such as chain-of-thought and few-shot prompting, yielded promising results, but still fell short of real-world applicability and do not surpass human baseline performance. However, we found that fine-tuning significantly improves the agent’s access decisions, reaching up to 98% accuracy, making it promising for future confidentiality-aware applications when data is available.