2025
pdf
bib
abs
Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data
Juanhui Li
|
Sreyashi Nag
|
Hui Liu
|
Xianfeng Tang
|
Sheikh Muhammad Sarwar
|
Limeng Cui
|
Hansu Gu
|
Suhang Wang
|
Qi He
|
Jiliang Tang
Findings of the Association for Computational Linguistics: NAACL 2025
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. However, the large size and high computation demands of LLMs limit their practicality in many applications, especially when further fine-tuning is required. To address these limitations, smaller models are typically preferred for deployment. However, their training is hindered by the scarcity of labeled data. In contrast, unlabeled data is often readily which can be leveraged by using LLMs to generate pseudo-labels for training smaller models. This enables the smaller models (student) to acquire knowledge from LLMs (teacher) while reducing computational costs. This process introduces challenges, such as potential noisy pseudo-labels. % and the high computational expense of processing large unlabeled datasets. Selecting high-quality and informative data is therefore critical to enhance model performance while improving the efficiency of data utilization. To address this, we propose LLKD that enables Learning with Less computational resources and less data for Knowledge Distillation from LLMs. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Specifically, it prioritizes samples where the teacher demonstrates high confidence in its labeling, indicating reliable labels, and where the student exhibits a high information need, identifying challenging samples that require further learning. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
pdf
bib
abs
IHEval: Evaluating Language Models on Following the Instruction Hierarchy
Zhihan Zhang
|
Shiyang Li
|
Zixuan Zhang
|
Xin Liu
|
Haoming Jiang
|
Xianfeng Tang
|
Yifan Gao
|
Zheng Li
|
Haodong Wang
|
Zhaoxuan Tan
|
Yichuan Li
|
Qingyu Yin
|
Bing Yin
|
Meng Jiang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models’ ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs.
pdf
bib
abs
SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains
Ran Xu
|
Hui Liu
|
Sreyashi Nag
|
Zhenwei Dai
|
Yaochen Xie
|
Xianfeng Tang
|
Chen Luo
|
Yang Li
|
Joyce C. Ho
|
Carl Yang
|
Qi He
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Retrieval-augmented generation (RAG) enhances the question answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips LLMs with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes LLMs on instruction-following, question-answering, and search-related data. Then, it prompts LLMs to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these synthetic examples, the LLMs can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets across three different domains verify the efficacy of SimRAG over baselines by 1.2%–8.6%.
2024
pdf
bib
abs
BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering
Haoyu Wang
|
Ruirui Li
|
Haoming Jiang
|
Jinjin Tian
|
Zhengyang Wang
|
Chen Luo
|
Xianfeng Tang
|
Monica Xiao Cheng
|
Tuo Zhao
|
Jing Gao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Retrieval-augmented Large Language Models (LLMs) offer substantial benefits in enhancing performance across knowledge-intensive scenarios. However, these methods often struggle with complex inputs and encounter difficulties due to noisy knowledge retrieval, notably hindering model effectiveness. To address this issue, we introduce BlendFilter, a novel approach that elevates retrieval-augmented LLMs by integrating query generation blending with knowledge filtering. BlendFilter proposes the blending process through its query generation method, which integrates both external and internal knowledge augmentation with the original query, ensuring comprehensive information gathering. Additionally, our distinctive knowledge filtering module capitalizes on the intrinsic capabilities of the LLM, effectively eliminating extraneous data. We conduct extensive experiments on three open-domain question answering benchmarks, and the findings clearly indicate that our innovative BlendFilter surpasses state-of-the-art baselines significantly.
pdf
bib
abs
Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark
Fenglin Liu
|
Zheng Li
|
Hongjian Zhou
|
Qingyu Yin
|
Jingfeng Yang
|
Xianfeng Tang
|
Chen Luo
|
Ming Zeng
|
Haoming Jiang
|
Yifan Gao
|
Priyanka Nigam
|
Sreyashi Nag
|
Bing Yin
|
Yining Hua
|
Xuan Zhou
|
Omid Rohanian
|
Anshul Thakur
|
Lei Clifton
|
David A. Clifton
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs
pdf
bib
abs
Sequential LLM Framework for Fashion Recommendation
Han Liu
|
Xianfeng Tang
|
Tianlang Chen
|
Jiapeng Liu
|
Indu Indu
|
Henry Peng Zou
|
Peng Dai
|
Roberto Fernandez Galan
|
Michael D Porter
|
Dongmei Jia
|
Ning Zhang
|
Lian Xiong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
The fashion industry is one of the leading domains in the global e-commerce sector, prompting major online retailers to employ recommendation systems for product suggestions and customer convenience. While recommendation systems have been widely studied, most are designed for general e-commerce problems and struggle with the unique challenges of the fashion domain. To address these issues, we propose a sequential fashion recommendation framework that leverages a pre-trained large language model (LLM) enhanced with recommendation-specific prompts. Our framework employs parameter-efficient fine-tuning with extensive fashion data and introduces a novel mix-up-based retrieval technique for translating text into relevant product suggestions. Extensive experiments show our proposed framework significantly enhances fashion recommendation performance.
pdf
bib
abs
Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs
Bowen Jin
|
Chulin Xie
|
Jiawei Zhang
|
Kashob Kumar Roy
|
Yu Zhang
|
Zheng Li
|
Ruirui Li
|
Xianfeng Tang
|
Suhang Wang
|
Yu Meng
|
Jiawei Han
Findings of the Association for Computational Linguistics: ACL 2024
Large language models (LLMs), while exhibiting exceptional performance, suffer from hallucinations, especially on knowledge-intensive tasks. Existing works propose to augment LLMs with individual text units retrieved from external knowledge corpora to alleviate the issue. However, in many domains, texts are interconnected (e.g., academic papers in a bibliographic graph are linked by citations and co-authorships) which form a (text-attributed) graph. The knowledge in such graphs is encoded not only in single texts/nodes but also in their associated connections. To facilitate the research of augmenting LLMs with graphs, we manually construct a Graph Reasoning Benchmark dataset called GRBench, containing 1,740 questions that can be answered with the knowledge from 10 domain graphs. Then, we propose a simple and effective framework called Graph Chain-of-thought (Graph-CoT) to augment LLMs with graphs by encouraging LLMs to reason on the graph iteratively. Each Graph-CoT iteration consists of three sub-steps: LLM reasoning, LLM-graph interaction, and graph execution. We conduct systematic experiments with three LLM backbones on GRBench, where Graph-CoT outperforms the baselines consistently. The code is available at https://github.com/PeterGriffinJin/Graph-CoT/.