Where Privacy Risk Lives in English-Source Multilingual RAG: A Stage-Decomposed Audit Across Five Query Languages

Yanhang Li, Zhichao Fan, Zexin Zhuang


Abstract
A common assumption holds that switching to a non-English language makes a multilingual RAG system easier to attack for personal information. On an English-source synthetic-PII corpus with five query languages and a two-stage defence (LLM input judge + regex output filter), the output-stage point estimates do not support that assumption: English has the highest observed unstructured-PII leak rate, and only English-vs-Swahili separates cleanly under our document-level bootstrap intervals. Once the input judge is added, residual leaks remain on Arabic and Swahili in this Qwen-mediated pipeline, and back-translating the query does not close the gap. Translator, judge, and generator share one model family, so we treat this as pipeline-conditional rather than a causal language ranking. As an oracle diagnostic on a separate n=17 multilingual-prompted-judge residual corner, attaching the gold corpus document to the input judge blocks 15/17 residual cells — a follow-up direction, not a deployed mitigation, since all BLOCK/ALLOW rates are on adversarial queries only and we measure no benign-query FPR or utility. The anonymous supplement contains code, corpora, queries, and per-trial JSONLs.
Anthology ID:
2026.mellm-1.28
Volume:
Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Month:
July
Year:
2026
Address:
San Diego, United States
Editors:
Kaiyu Huang, Fengran Mo, Pinzhen Chen, Meng Jiang
Venues:
MeLLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
284–293
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.mellm-1.28/
DOI:
Bibkey:
Cite (ACL):
Yanhang Li, Zhichao Fan, and Zexin Zhuang. 2026. Where Privacy Risk Lives in English-Source Multilingual RAG: A Stage-Decomposed Audit Across Five Query Languages. In Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), pages 284–293, San Diego, United States. Association for Computational Linguistics.
Cite (Informal):
Where Privacy Risk Lives in English-Source Multilingual RAG: A Stage-Decomposed Audit Across Five Query Languages (Li et al., MeLLM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.mellm-1.28.pdf