Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

Jaeyoung Choe, Jihoon Kim, Woohwan Jung


Abstract
Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts,and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.
Anthology ID:
2025.findings-acl.855
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16663–16681
Language:
URL:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.855/
DOI:
10.18653/v1/2025.findings-acl.855
Bibkey:
Cite (ACL):
Jaeyoung Choe, Jihoon Kim, and Woohwan Jung. 2025. Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 16663–16681, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents (Choe et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.855.pdf