How Do LLMs "Trust" Unknown Knowledge? An Unknown Knowledge Based Jailbreak Attack

Yixiao Huang; Lan Zhang; Chaoran Wang

How Do LLMs "Trust" Unknown Knowledge? An Unknown Knowledge Based Jailbreak Attack

Abstract

Learning unknown knowledge through ICL and RAG can enhance LLM capabilities in specialized fields. While most research focuses on how to identify and utilize such knowledge, little work examines what factors lead LLMs to trust and adopt it, leaving models prone to errors and harmful content. Grounded in extensive pre-experiments, we design five pairs of trust-enhancing and trust-diminishing transformations on unknown knowledge to experimentally identify the key trust factors. These findings are further substantiated through a detailed theoretical analysis grounded in the epistemological framework of evidentialism. Based on these insights, we challengingly propose a completely unrestricted and fully randomized jailbreak attack that embeds malicious queries within trust-enhanced unknown knowledge. In both defended and undefended scenarios, our method achieves 99% to 100% ASR on all tested LLMs, including the latest GPT-5.1, and becomes SOTA. This attack confirms the trust mechanism and exposes a critical and hard-to-defend security risk. Our conclusions provide valuable guidance for understanding trust mechanism of unknown knowledge and for future research.

Anthology ID:: 2026.findings-acl.1849
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37105–37124
Language:
URL:: https://preview.aclanthology.org/ingestion-form-platform/2026.findings-acl.1849/
DOI:
Bibkey:
Cite (ACL):: Yixiao Huang, Lan Zhang, and Chaoran Wang. 2026. How Do LLMs "Trust" Unknown Knowledge? An Unknown Knowledge Based Jailbreak Attack. In Findings of the Association for Computational Linguistics: ACL 2026, pages 37105–37124, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: How Do LLMs “Trust” Unknown Knowledge? An Unknown Knowledge Based Jailbreak Attack (Huang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-form-platform/2026.findings-acl.1849.pdf
Checklist:: 2026.findings-acl.1849.checklist.pdf

PDF Cite Search Checklist Fix data