Xiao Zhu


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
FaStFact: Faster, Stronger Long-Form Factuality Evaluations in LLMs
Yingjia Wan | Haochen Tan | Xiao Zhu | Xinyu Zhou | Zhiwei Li | Qingsong Lv | Changxuan Sun | Jiaqi Zeng | Yi Xu | Jianqiao Lu | Yinhong Liu | Zhijiang Guo
Findings of the Association for Computational Linguistics: EMNLP 2025

Evaluating the factuality of long-form generations from Large Language Models (LLMs) remains challenging due to accuracy issues and costly human assessment. Prior evaluation pipelines attempt this by decomposing text into claims, searching for evidence, and verifying claims, but suffer from critical drawbacks: (1) inefficiency due to complex pipeline components unsuitable for long LLM outputs, and (2) ineffectiveness stemming from inaccurate claim sets and insufficient evidence collection of one-line SERP snippets. To address these limitations, we adapt the existing decompose-then-verify evaluation framework and propose **FaStFact**, a fast and strong evaluation pipeline that achieves the highest alignment with human evaluation and efficiency among existing baselines. FaStFact first employs chunk-level claim extraction integrated with confidence-based pre-verification, significantly reducing the cost of web searching and inference calling while ensuring reliability. For searching and verification, it gathers document-level evidence from crawled website pages for retrieval during verification, addressing the evidence insufficiency problem in previous pipelines. Extensive experiments based on an aggregated and manually annotated benchmark demonstrate the reliability of FaStFact in both efficiently and effectively evaluating the factuality of long-form LLM generations. We submit the paper with code and benchmark, and will make them publicly available to facilitate research.