Han Fang


2025

pdf bib
Improving Model Factuality with Fine-grained Critique-based Evaluator
Yiqing Xie | Wenxuan Zhou | Pradyot Prakash | Di Jin | Yuning Mao | Quintin Fettes | Arya Talebzadeh | Sinong Wang | Han Fang | Carolyn Rose | Daniel Fried | Hejia Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. In particular, we train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools, via data augmentation on a combination of public judgment datasets. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, ask FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator’s accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama2-7B-chat/Llama3-8B-chat’s factuality rate by 16.86%/14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 8.83%/6.96%.

pdf bib
Learning Auxiliary Tasks Improves Reference-Free Hallucination Detection in Open-Domain Long-Form Generation
Chengwei Qin | Wenxuan Zhou | Karthik Abinav Sankararaman | Nanshu Wang | Tengyu Xu | Alexander Radovic | Eryk Helenowski | Arya Talebzadeh | Aditya Tayade | Sinong Wang | Shafiq Joty | Han Fang | Hao Ma
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Hallucination, the generation of factually incorrect information, remains a significant challenge for large language models (LLMs), especially in open-domain long-form generation. Existing approaches for detecting hallucination in long-form tasks either focus on limited domains or rely heavily on external fact-checking tools, which may not always be available.In this work, we systematically investigate reference-free hallucination detection in open-domain long-form responses. Our findings reveal that internal states (e.g., model’s output probability and entropy) alone are insufficient for reliably (i.e., better than random guessing) distinguishing between factual and hallucinated content. To enhance detection, we explore various existing approaches, including prompting-based methods, probing, and fine-tuning, with fine-tuning proving the most effective. To further improve the accuracy, we introduce a new paradigm, named RATE-FT, that augments fine-tuning with an auxiliary task for the model to jointly learn with the main task of hallucination detection. With extensive experiments and analysis using a variety of model families & datasets, we demonstrate the effectiveness and generalizability of our method, e.g., +3% over general fine-tuning methods on LongFact.

pdf bib
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
Yen-Ting Lin | Di Jin | Tengyu Xu | Tianhao Wu | Sainbayar Sukhbaatar | Chen Zhu | Yun He | Yun-Nung Chen | Jason E Weston | Yuandong Tian | Arash Rahnama | Sinong Wang | Hao Ma | Han Fang
Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025)

Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

2024

pdf bib
Effective Long-Context Scaling of Foundation Models
Wenhan Xiong | Jingyu Liu | Igor Molybog | Hejia Zhang | Prajjwal Bhargava | Rui Hou | Louis Martin | Rashi Rungta | Karthik Abinav Sankararaman | Barlas Oguz | Madian Khabsa | Han Fang | Yashar Mehdad | Sharan Narang | Kshitiz Malik | Angela Fan | Shruti Bhosale | Sergey Edunov | Mike Lewis | Sinong Wang | Hao Ma
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We present an effective recipe to train strong long-context LLMs that are capable of utilizing massive context windows of up to 32,000 tokens. Our models are built through continual pretraining from Llama 2 checkpoints with longer text sequences and on a dataset where long texts are upsampled. We perform extensive evaluation using language modeling, synthetic context probing tasks, and a wide range of downstream benchmarks. Across all evaluations, our models achieve consistent improvements on most regular-context tasks and significant improvements on long-context tasks over Llama 2. Moreover, with a cost-effective instruction tuning procedure that is free of expensive annotation, the presented models can already surpass gpt-3.5-turbo-16k‘s overall performance on long-context benchmarks. Alongside these results, we provide an in-depth analysis on each individual component of our method. We delve into Llama’s position encodings and discuss its key limitation in modeling long data. We examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths – ablation results suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.