FASTTRACK: Reliable Fact Tracing via Clustering and LLM-Powered Evidence Validation

Si Chen, Feiyang Kang, Ning Yu, Ruoxi Jia


Abstract
Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely relevant and those that actually provide supportive evidence for the information sought by the query. This limitation often results in suboptimal effectiveness. Moreover, these approaches necessitate the examination of the similarity of individual training points for each query, imposing significant computational demands and creating a substantial barrier for practical applications. This paper introduces FASTTRACK, a novel approach that harnesses the capabilities of Large Language Models (LLMs) to validate supportive evidence for queries and at the same time clusters the training database towards a reduced extent for LLMs to trace facts. Our experiments show that FASTTRACK substantially outperforms existing methods in both accuracy and efficiency, achieving more than 100% improvement in F1 score over the state-of-the-art methods while being x33 faster than TracIn.
Anthology ID:
2024.findings-emnlp.334
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5821–5836
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.334/
DOI:
10.18653/v1/2024.findings-emnlp.334
Bibkey:
Cite (ACL):
Si Chen, Feiyang Kang, Ning Yu, and Ruoxi Jia. 2024. FASTTRACK: Reliable Fact Tracing via Clustering and LLM-Powered Evidence Validation. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5821–5836, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
FASTTRACK: Reliable Fact Tracing via Clustering and LLM-Powered Evidence Validation (Chen et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.334.pdf
Software:
 2024.findings-emnlp.334.software.zip
Data:
 2024.findings-emnlp.334.data.zip