FactEval: Evaluating the Robustness of Fact Verification Systems in the Era of Large Language Models

Mamta Mamta, Oana Cocarascu


Abstract
Whilst large language models (LLMs) have made significant advances in every natural language processing task, studies have shown that these models are vulnerable to small perturbations in the inputs, raising concerns about their robustness in the real-world. Given the rise of misinformation online and its significant impact on society, fact verification is one area in which assessing the robustness of models developed for this task is crucial. However, the robustness of LLMs in fact verification remains largely unexplored. In this paper, we introduce FactEval, a novel large-scale benchmark for extensive evaluation of LLMs in the fact verification domain covering 17 realistic word-level and character-level perturbations and 4 types of subpopulations. We investigate the robustness of several LLMs in zero-shot, few-shot, and chain-of-thought prompting. Our analysis using FEVER, one of the largest and most widely-used datasets for fact verification, reveals that LLMs are brittle to small input changes and also exhibit performance variations across different subpopulations.
Anthology ID:
2025.naacl-long.534
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10647–10660
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.534/
DOI:
Bibkey:
Cite (ACL):
Mamta Mamta and Oana Cocarascu. 2025. FactEval: Evaluating the Robustness of Fact Verification Systems in the Era of Large Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 10647–10660, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
FactEval: Evaluating the Robustness of Fact Verification Systems in the Era of Large Language Models (Mamta & Cocarascu, NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.534.pdf