2025
pdf
bib
abs
Diagnosing Failures in Large Language Models’ Answers: Integrating Error Attribution into Evaluation Framework
Zishan Xu
|
Shuyi Xie
|
Qingsong Lv
|
Shupei Xiao
|
Linlin Song
|
Sui Wenjuan
|
Fan Lin
Findings of the Association for Computational Linguistics: ACL 2025
With the widespread application of Large Language Models (LLMs) in various tasks, the mainstream LLM platforms generate massive user-model interactions daily. In order to efficiently analyze the performance of models and diagnose failures in their answers, it is essential to develop an automated framework to systematically categorize and attribute errors. However, existing evaluation models lack error attribution capability. In this work, we establish a comprehensive Misattribution Framework with 6 primary and 15 secondary categories to facilitate in-depth analysis. Based on this framework, we present AttriData, a dataset specifically designed for error attribution, encompassing misattribution, along with the corresponding scores and feedback. We also propose MisAttributionLLM, a fine-tuned model on AttriData, which is the first general-purpose judge model capable of simultaneously generating score, misattribution, and feedback. Extensive experiments and analyses are conducted to confirm the effectiveness and robustness of our proposed method.
pdf
bib
abs
RAISE: Reinforced Adaptive Instruction Selection For Large Language Models
Qingsong Lv
|
Yangning Li
|
Zihua Lan
|
Zishan Xu
|
Jiwei Tang
|
Tingwei Lu
|
Yinghui Li
|
Wenhao Jiang
|
Hong-Gee Kim
|
Hai-Tao Zheng
|
Philip S. Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Instruction tuning of large language models (LLMs) benefits more from a handful of high-quality examples than from hordes of low-quality ones. Existing selection methods typically rely on static, heuristic quality scores and are executed only once before training. Consequently, they neither adapt to the changing state of the model nor target downstream objectives, leaving substantial room for optimization. We propose RAISE (**R**einforced **A**daptive **I**nstruction **SE**lection), a *dynamic*, *task-driven* framework that integrates selection into every training step. At each step, RAISE estimates the expected contribution of each candidate instruction to task performance and admits only the most helpful. By modeling this process as sequential decision making, we optimize the selector with reinforcement learning, yielding an interpretable policy specialized for the target task. Extensive experiments show that RAISE reaches comparable or better results than full-data training while updating only 1% of the steps, demonstrating both high efficacy and significant computational savings.
pdf
bib
abs
FaStFact: Faster, Stronger Long-Form Factuality Evaluations in LLMs
Yingjia Wan
|
Haochen Tan
|
Xiao Zhu
|
Xinyu Zhou
|
Zhiwei Li
|
Qingsong Lv
|
Changxuan Sun
|
Jiaqi Zeng
|
Yi Xu
|
Jianqiao Lu
|
Yinhong Liu
|
Zhijiang Guo
Findings of the Association for Computational Linguistics: EMNLP 2025
Evaluating the factuality of long-form generations from Large Language Models (LLMs) remains challenging due to accuracy issues and costly human assessment. Prior evaluation pipelines attempt this by decomposing text into claims, searching for evidence, and verifying claims, but suffer from critical drawbacks: (1) inefficiency due to complex pipeline components unsuitable for long LLM outputs, and (2) ineffectiveness stemming from inaccurate claim sets and insufficient evidence collection of one-line SERP snippets. To address these limitations, we adapt the existing decompose-then-verify evaluation framework and propose **FaStFact**, a fast and strong evaluation pipeline that achieves the highest alignment with human evaluation and efficiency among existing baselines. FaStFact first employs chunk-level claim extraction integrated with confidence-based pre-verification, significantly reducing the cost of web searching and inference calling while ensuring reliability. For searching and verification, it gathers document-level evidence from crawled website pages for retrieval during verification, addressing the evidence insufficiency problem in previous pipelines. Extensive experiments based on an aggregated and manually annotated benchmark demonstrate the reliability of FaStFact in both efficiently and effectively evaluating the factuality of long-form LLM generations. We submit the paper with code and benchmark, and will make them publicly available to facilitate research.
2022
pdf
bib
abs
Parameter-Efficient Tuning Makes a Good Classification Head
Zhuoyi Yang
|
Ming Ding
|
Yanhui Guo
|
Qingsong Lv
|
Jie Tang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
In recent years, pretrained models revolutionized the paradigm of natural language understanding (NLU), where we append a randomly initialized classification head after the pretrained backbone, e.g. BERT, and finetune the whole model. As the pretrained backbone makes a major contribution to the improvement, we naturally expect a good pretrained classification head can also benefit the training. However, the final-layer output of the backbone, i.e. the input of the classification head, will change greatly during finetuning, making the usual head-only pretraining ineffective. In this paper, we find that parameter-efficient tuning makes a good classification head, with which we can simply replace the randomly initialized heads for a stable performance gain. Our experiments demonstrate that the classification head jointly pretrained with parameter-efficient tuning consistently improves the performance on 9 tasks in GLUE and SuperGLUE.
2018
pdf
bib
abs
Cross-lingual Knowledge Graph Alignment via Graph Convolutional Networks
Zhichun Wang
|
Qingsong Lv
|
Xiaohan Lan
|
Yu Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Multilingual knowledge graphs (KGs) such as DBpedia and YAGO contain structured knowledge of entities in several distinct languages, and they are useful resources for cross-lingual AI and NLP applications. Cross-lingual KG alignment is the task of matching entities with their counterparts in different languages, which is an important way to enrich the cross-lingual links in multilingual KGs. In this paper, we propose a novel approach for cross-lingual KG alignment via graph convolutional networks (GCNs). Given a set of pre-aligned entities, our approach trains GCNs to embed entities of each language into a unified vector space. Entity alignments are discovered based on the distances between entities in the embedding space. Embeddings can be learned from both the structural and attribute information of entities, and the results of structure embedding and attribute embedding are combined to get accurate alignments. In the experiments on aligning real multilingual KGs, our approach gets the best performance compared with other embedding-based KG alignment approaches.