Yubo Zhou
2026
Are Large Language Models Reliable Reviewers? A Benchmark for Error Detection in Financial Documents
Ying He | Zhouhong Gu | Zhecheng Hu | Yubo Zhou | Hao Shen | Jiaqing Liang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao | Zhixu Li
Findings of the Association for Computational Linguistics: ACL 2026
Ying He | Zhouhong Gu | Zhecheng Hu | Yubo Zhou | Hao Shen | Jiaqing Liang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao | Zhixu Li
Findings of the Association for Computational Linguistics: ACL 2026
Ensuring the accuracy of financial documents is critical for economic analysis, regulatory compliance, and corporate decision-making. Several studies have shown that Large Language Models (LLMs) perform well in many financial tasks, such as stock price movements and financial analytics. However, a critical task remains unexplored: the ability of LLMs to identify errors in financial documents. In this paper, we introduce **FinED-Bench**, the first publicly Benchmark for Financial Error Detection across three levels of cognitive complexity. FinED-Bench covers nine real-world financial scenarios, and includes over 900 documents reported in 2025 that are unseen by existing language models. We detail the benchmark construction process and evaluate several advanced LLMs (e.g., GPT-4o, Qwen3-14B) on this tasks, which requires both financial domain knowledge and reasoning capabilities. Experimental results show that current LLMs still struggle with this task, especially in high-complexity cases. Besides, supervised fine-tuning can significantly improve the performance of weaker LLMs on this task. Our data and code are available at https://anonymous.4open.science/r/FinED-Bench-406F.