LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data

Cehao Yang; Xueyuan Lin; Chengjin Xu; Xuhui Jiang; Shengjie Ma; Aofan Liu; Hui Xiong; Jian Guo

LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data

Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, Jian Guo

Abstract

Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets—LongFaith-SFT and LongFaith-PO—which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.

Anthology ID:: 2025.findings-acl.169
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:: Findings | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3236–3256
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.169/
DOI:
Bibkey:
Cite (ACL):: Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, and Jian Guo. 2025. LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data. In Findings of the Association for Computational Linguistics: ACL 2025, pages 3236–3256, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data (Yang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.169.pdf

PDF Cite Search Fix data