On the Feasibility of In-Context Probing for Data Attribution

Cathy Jiao, Weizhen Gao, Aditi Raghunathan, Chenyan Xiong


Abstract
Data attribution methods are used to measure the contribution of training data towards model outputs, and have several important applications in areas such as dataset curation and model interpretability. However, many standard data attribution methods, such as influence functions, utilize model gradients and are computationally expensive. In our paper, we show in-context probing (ICP) – prompting a LLM – can serve as a fast proxy for gradient-based data attribution for data selection under conditions contingent on data similarity. We study this connection empirically on standard NLP tasks, and show that ICP and gradient-based data attribution are well-correlated in identifying influential training data for tasks that share similar task type and content as the training data. Additionally, fine-tuning models on influential data selected by both methods achieves comparable downstream performance, further emphasizing their similarities. We then examine the connection between ICP and gradient-based data attribution using synthetic data on linear regression tasks. Our synthetic data experiments show similar results with those from NLP tasks, suggesting that this connection can be isolated in simpler settings, which offers a pathway to bridging their differences.
Anthology ID:
2025.findings-naacl.286
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5140–5155
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.286/
DOI:
Bibkey:
Cite (ACL):
Cathy Jiao, Weizhen Gao, Aditi Raghunathan, and Chenyan Xiong. 2025. On the Feasibility of In-Context Probing for Data Attribution. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 5140–5155, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
On the Feasibility of In-Context Probing for Data Attribution (Jiao et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.286.pdf