Chiwei Zhu


2025

pdf bib
From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding
Chiwei Zhu | Benfeng Xu | Xiaorui Wang | Zhendong Mao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora.

pdf bib
Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability
Chiwei Zhu | Benfeng Xu | An Yang | Junyang Lin | Quan Wang | Chang Zhou | Zhendong Mao
Findings of the Association for Computational Linguistics: ACL 2025

Training language models with rationales augmentation has been shown to be beneficial in many existing works. In this paper, we identify that such a prevailing view does not hold consistently. We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance as well as a novel perspective of model reliability. The results lead to several key findings that add new insights upon existing understandings: 1) Rationales can, at times, deteriorate model performance; 2) Rationales can, at times, improve model reliability, even outperforming their untrained counterparts; 3) A linear correspondence exists in between the performance and reliability improvements, while both are driven by the intrinsic difficulty of the task. These findings provide informative regulations on the broad utilization of rationales and raise critical implications on the procedure of explicitly aligning language models with implicit human thoughts. Codes can be found in this anonymous link: https://anonymous.4open.science/r/rationales-CEE8.

2024

pdf bib
KNN-Instruct: Automatic Instruction Construction with K Nearest Neighbor Deduction
Jianshang Kou | Benfeng Xu | Chiwei Zhu | Zhendong Mao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Supervised fine-tuning (SFT) is a critical procedure for aligning large language models. Despite its efficiency, the construction of SFT data often struggles with issues of quality, diversity, and scalability. Many existing methods, inspired by the Self-Instruct framework, typically generate synthetic instructions by prompting aligned proprietary models like ChatGPT. However, such process suffers from stale distribution, resulting in instructions that are merely trivial variations of existing ones. In this paper, we introduce a novel bootstrapping approach termed KNN-Instruct, which incorporates KNN deduction to produce meaningful new instructions by effectively summarizing and learning from similar existing ones. We conduct an economical controlled experiment to preliminarily validate its effectiveness. In the further experiment, we construct a high-quality SFT dataset named KNN-Inst-12k*. Applying the dataset to Qwen-2-7B, we get a MT-Bench score of 7.64, which outperforms all 7B models on the LMSYS leaderboard, including Starling-LM-7B (7.48), OpenChat-3.5 (7.06) and Zephyr-7B-beta (6.53). Our code and data are available at https://github.com/CrossmodalGroup/KNN-Instruct/.

2023

pdf bib
Grammatical Error Correction via Mixed-Grained Weighted Training
Jiahao Li | Quan Wang | Chiwei Zhu | Zhendong Mao | Yongdong Zhang
Findings of the Association for Computational Linguistics: EMNLP 2023

The task of Grammatical Error Correction (GEC) aims to automatically correct grammatical errors in natural texts. Almost all previous works treat annotated training data equally, but inherent discrepancies in data are neglected. In this paper, the inherent discrepancies are manifested in two aspects, namely, accuracy of data annotation and diversity of potential annotations. To this end, we propose MainGEC, which designs token-level and sentence-level training weights based on inherent discrepancies therein, and then conducts mixed-grained weighted training to improve the training effect for GEC. Empirical evaluation shows that whether in the Seq2Seq or Seq2Edit manner, MainGEC achieves consistent and significant performance improvements on two benchmark datasets, demonstrating the effectiveness and superiority of the mixed-grained weighted training. Further ablation experiments verify the effectiveness of designed weights for both granularities in MainGEC.

pdf bib
On the Calibration of Large Language Models and Alignment
Chiwei Zhu | Benfeng Xu | Quan Wang | Yongdong Zhang | Zhendong Mao
Findings of the Association for Computational Linguistics: EMNLP 2023

As large language models attract increasing attention and find widespread application, concurrent challenges of reliability also arise at the same time. Confidence calibration, an effective analysis method for gauging the reliability of deep models, serves as a crucial tool for assessing and improving their reliability. However, such investigation has been comparatively underexplored. In this work, we conduct a systematic examination of the calibration of aligned language models throughout the entire construction process, including pretraining and alignment training. At each stage, we investigate how different training settings, such as parameter scales and training data, affect model calibration. To thoroughly assess model calibration, we evaluate models on three most concerned aspects: generation, factuality and understanding. Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.