EULoInf: Efficient Hessian-Free Entropy Based Uncertainty-Aware Data Influence Approximation

Runxin Cai; Jingtan Wang; Bryan Kian Hsiang Low

EULoInf: Efficient Hessian-Free Entropy Based Uncertainty-Aware Data Influence Approximation

Runxin Cai, Jingtan Wang, Bryan Kian Hsiang Low

Abstract

In Large Language Model post-training, high-quality data effectively enhances model performance with fine-tuning, highlighting the need to identify high-quality and beneficial fine-tuning data. However, one of the most popular data valuation paradigms, influence function and its variants, are computationally expensive due to their reliance on inverse Hessian-Vector Products (iHVP) computations that scale poorly with increasing model size. To examine whether influence values correlate with efficiently computable intrinsic features, we empirically investigate the distribution of top influential data for the model in fine-tuning, and observe that data with high influence tend to be those with high predictive uncertainty. Yet such highly uncertain samples exhibit a dual nature, which can be either beneficial or detrimental noisy data. Unlike traditional methods that treat uncertainty as a standalone criterion, we introduce a directional indicator to rigorously disentangle these opposing effects. Formally, we propose EULoInf (Entropy-based Uncertainty-aware Lookahead Influence), a computationally efficient valuation framework. By approximating influence via uncertainty and gradient based validation loss lookahead, EULoInf avoids iHVP computation, effectively reducing the iHVP-induced quadratic complexity in model parameters to linear time. We rigorously derive our framework from the influence function. Empirically, it matches or even outperforms prior methods across diverse data valuation tasks and LLM architectures, including mislabel detection and data selection, while reducing computational time and memory usage by over 50%.

Anthology ID:: 2026.findings-acl.1839
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 36911–36928
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1839/
DOI:
Bibkey:
Cite (ACL):: Runxin Cai, Jingtan Wang, and Bryan Kian Hsiang Low. 2026. EULoInf: Efficient Hessian-Free Entropy Based Uncertainty-Aware Data Influence Approximation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 36911–36928, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: EULoInf: Efficient Hessian-Free Entropy Based Uncertainty-Aware Data Influence Approximation (Cai et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1839.pdf
Checklist:: 2026.findings-acl.1839.checklist.pdf

PDF Cite Search Checklist Fix data