CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

Qi Li; Cheng-Long Wang; Yinzhi Cao; Di Wang

CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

Qi Li, Cheng-Long Wang, Yinzhi Cao, Di Wang

Abstract

Training models on a carefully chosen portion of data rather than the full dataset is now a standard preprocess for modern ML. From vision coreset selection to large-scale filtering in language models, it enables scalability with minimal utility loss. A common intuition is that training on fewer samples should also reduce privacy risks. In this paper, we challenge this assumption. We show that subset training is not privacy free: the very choices of which data are included or excluded can introduce new privacy surface and leak more sensitive information. Such information can be captured by adversaries either through side-channel metadata from the subset selection process or via the outputs of the target model. To systematically study this phenomenon, we propose CoLA (Choice Leakage Attack), a unified framework for analyzing privacy leakage in subset selection. In CoLA, depending on the adversary’s knowledge of the side-channel information, we define two practical attack scenarios: Subset-aware Side-channel Attacks and Black-box Attacks. Under both scenarios, we investigate two privacy surfaces unique to subset training: (1) Training-membership MIA (TM-MIA), which concerns only the privacy of training data membership, and (2) Selection-participation MIA (SP-MIA), which concerns the privacy of all samples that participated in the subset selection process. Notably, SP-MIA enlarges the notion of membership from model training to the entire data-model supply chain. Experiments on vision and language models show that existing threat models underestimate subset-training privacy risks: the expanded privacy surface leaks both training and selection membership, extending risks from individual models to the broader ML ecosystem.

Anthology ID:: 2026.acl-long.733
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16137–16149
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.733/
DOI:
Bibkey:
Cite (ACL):: Qi Li, Cheng-Long Wang, Yinzhi Cao, and Di Wang. 2026. CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16137–16149, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training (Li et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.733.pdf
Checklist:: 2026.acl-long.733.checklist.pdf

PDF Cite Search Checklist Fix data