Huan Liu
Other people with similar names: Huan Liu
Unverified author pages with similar names: Huan Liu
2026
ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents
Dawei Li | Yuguang Yao | Zhen Tan | Huan Liu | Ruocheng Guo
Findings of the Association for Computational Linguistics: ACL 2026
Dawei Li | Yuguang Yao | Zhen Tan | Huan Liu | Ruocheng Guo
Findings of the Association for Computational Linguistics: ACL 2026
Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-use settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-use benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Our code and dataset are available at: https://github.com/David-Li0406/ToolPRMBench[More resources on LLM-as-a-judge are on the website: <https://llm-as-a-judge.github.io>].
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Chengshuai Zhao | Zhen Tan | Pingchuan Ma | Dawei Li | Bohan Jiang | Yancheng Wang | Yingzhen Yang | Huan Liu
Findings of the Association for Computational Linguistics: ACL 2026
Chengshuai Zhao | Zhen Tan | Pingchuan Ma | Dawei Li | Bohan Jiang | Yancheng Wang | Yingzhen Yang | Huan Liu
Findings of the Association for Computational Linguistics: ACL 2026
Chain-of-Thought (CoT) prompting has been shown to be effective in eliciting structured reasoning (i.e., CoT reasoning) from large language models (LLMs). Regardless of its popularity, recent studies expose its failures in some reasoning tasks, raising fundamental questions about the nature of CoT reasoning. In this work, we propose a data distribution lens to understand when and why CoT reasoning succeeds or fails. We hypothesize that CoT reasoning reflects a structured inductive bias learned from in-distribution data, enabling models to conditionally generate reasoning trajectories that approximate those observed during training. As such, the effectiveness of CoT reasoning is fundamentally governed by the nature and degree of distribution discrepancy between training data and test queries. Guided by this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To test the hypothesis, we introduce DataAlchemy, an abstract and fully controllable environment that trains LLMs from scratch and systematically probes them under various distribution conditions. Through rigorous controlled experiments, we reveal that CoT reasoning is a brittle mirage when it is pushed beyond training distributions, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.
2025
In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents
Zhen Tan | Jun Yan | I-Hung Hsu | Rujun Han | Zifeng Wang | Long Le | Yiwen Song | Yanfei Chen | Hamid Palangi | George Lee | Anand Rajan Iyer | Tianlong Chen | Huan Liu | Chen-Yu Lee | Tomas Pfister
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhen Tan | Jun Yan | I-Hung Hsu | Rujun Han | Zifeng Wang | Long Le | Yiwen Song | Yanfei Chen | Hamid Palangi | George Lee | Anand Rajan Iyer | Tianlong Chen | Huan Liu | Chen-Yu Lee | Tomas Pfister
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities—utterances, turns, and sessions—into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li | Bohan Jiang | Liangjie Huang | Alimohammad Beigi | Chengshuai Zhao | Zhen Tan | Amrita Bhattacharjee | Yuxuan Jiang | Canyu Chen | Tianhao Wu | Kai Shu | Lu Cheng | Huan Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Dawei Li | Bohan Jiang | Liangjie Huang | Alimohammad Beigi | Chengshuai Zhao | Zhen Tan | Amrita Bhattacharjee | Yuxuan Jiang | Canyu Chen | Tianhao Wu | Kai Shu | Lu Cheng | Huan Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). Traditional methods, usually matching-based or small model-based, often fall short in open-ended and dynamic scenarios. Recent advancements in Large Language Models (LLMs) inspire the “LLM-as-a-judge” paradigm, where LLMs are leveraged to perform scoring, ranking, or selection for various machine learning evaluation scenarios. This paper presents a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to review this evolving field. We first provide the definition from both input and output perspectives. Then we introduce a systematic taxonomy to explore LLM-as-a-judge along three dimensions: what to judge, how to judge, and how to benchmark. Finally, we also highlight key challenges and promising future directions for this emerging area.
SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention
Chengshuai Zhao | Zhen Tan | Chau-Wai Wong | Xinyan Zhao | Tianlong Chen | Huan Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chengshuai Zhao | Zhen Tan | Chau-Wai Wong | Xinyan Zhao | Tianlong Chen | Huan Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Content analysis breaks down complex and unstructured texts into theory-informed numerical categories. Particularly, in social science, this process usually relies on multiple rounds of manual annotation, domain expert discussion, and rule-based refinement. In this paper, we introduce SCALE, a novel multi-agent framework that effectively ̲Simulates ̲Content ̲Analysis via ̲Large language model (LLM) ag ̲Ents. SCALE imitates key phases of content analysis, including text coding, collaborative discussion, and dynamic codebook evolution, capturing the reflective depth and adaptive discussions of human researchers. Furthermore, by integrating diverse modes of human intervention, SCALE is augmented with expert input to further enhance its performance. Extensive evaluations on real-world datasets demonstrate that SCALE achieves human-approximated performance across various complex content analysis tasks, offering an innovative potential for future social science research.
2024
Large Language Models for Data Annotation and Synthesis: A Survey
Zhen Tan | Dawei Li | Song Wang | Alimohammad Beigi | Bohan Jiang | Amrita Bhattacharjee | Mansooreh Karami | Jundong Li | Lu Cheng | Huan Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Zhen Tan | Dawei Li | Song Wang | Alimohammad Beigi | Bohan Jiang | Amrita Bhattacharjee | Mansooreh Karami | Jundong Li | Lu Cheng | Huan Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Data annotation and synthesis generally refers to the labeling or generating of raw data with relevant information, which could be used for improving the efficacy of machine learning models. The process, however, is labor-intensive and costly. The emergence of advanced Large Language Models (LLMs), exemplified by GPT-4, presents an unprecedented opportunity to automate the complicated process of data annotation and synthesis. While existing surveys have extensively covered LLM architecture, training, and general applications, we uniquely focus on their specific utility for data annotation. This survey contributes to three core aspects: LLM-Based Annotation Generation, LLM-Generated Annotations Assessment, and LLM-Generated Annotations Utilization. Furthermore, this survey includes an in-depth taxonomy of data types that LLMs can annotate, a comprehensive review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation and synthesis. Serving as a key guide, this survey aims to assist researchers and practitioners in exploring the potential of the latest LLMs for data annotation, thereby fostering future advancements in this critical field.
JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning
Anique Tahir | Lu Cheng | Huan Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Anique Tahir | Lu Cheng | Huan Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
The scaling of Large Language Models (LLMs) for retrieval-based tasks, particularly in Retrieval Augmented Generation (RAG), faces significant memory constraints, especially when fine-tuning extensive prompt sequences. Current open-source libraries support full-model inference and fine-tuning across multiple GPUs but fall short of accommodating the efficient parameter distribution required for retrieved context. Addressing this gap, we introduce a novel framework for PEFT-compatible fine-tuning of GPT models, leveraging distributed training. Our framework uniquely utilizes JAX’s just-in-time (JIT) compilation and tensor-sharding for efficient resource management, thereby enabling accelerated fine-tuning with reduced memory requirements. This advancement significantly improves the scalability and feasibility of fine-tuning LLMs for complex RAG applications, even on systems with limited GPU resources. Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPUs while consuming less than half the VRAM per GPU.
DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer’s Disease Questions with Scientific Literature
Dawei Li | Shu Yang | Zhen Tan | Jae Young Baik | Sukwon Yun | Joseph Lee | Aaron Chacko | Bojian Hou | Duy Duong-Tran | Ying Ding | Huan Liu | Li Shen | Tianlong Chen
Findings of the Association for Computational Linguistics: EMNLP 2024
Dawei Li | Shu Yang | Zhen Tan | Jae Young Baik | Sukwon Yun | Joseph Lee | Aaron Chacko | Bojian Hou | Duy Duong-Tran | Ying Ding | Huan Liu | Li Shen | Tianlong Chen
Findings of the Association for Computational Linguistics: EMNLP 2024
Recent advancements in large language models (LLMs) have achieved promising performances across various applications. Nonetheless, the ongoing challenge of integrating long-tail knowledge continues to impede the seamless adoption of LLMs in specialized domains. In this work, we introduce DALK, a.k.a. Dynamic Co-Augmentation of LLMs and KG, to address this limitation and demonstrate its ability on studying Alzheimer’s Disease (AD), a specialized sub-field in biomedicine and a global health priority. With a synergized framework of LLM and KG mutually enhancing each other, we first leverage LLM to construct an evolving AD-specific knowledge graph (KG) sourced from AD-related scientific literature, and then we utilize a coarse-to-fine sampling method with a novel self-aware knowledge retrieval approach to select appropriate knowledge from the KG to augment LLM inference capabilities. The experimental results, conducted on our constructed AD question answering (ADQA) benchmark, underscore the efficacy of DALK. Additionally, we perform a series of detailed analyses that can offer valuable insights and guidelines for the emerging topic of mutually enhancing KG and LLM.
Contextualization Distillation from Large Language Model for Knowledge Graph Completion
Dawei Li | Zhen Tan | Tianlong Chen | Huan Liu
Findings of the Association for Computational Linguistics: EACL 2024
Dawei Li | Zhen Tan | Tianlong Chen | Huan Liu
Findings of the Association for Computational Linguistics: EACL 2024
While textual information significantly enhances the performance of pre-trained language models (PLMs) in knowledge graph completion (KGC), the static and noisy nature of existing corpora collected from Wikipedia articles or synsets definitions often limits the potential of PLM-based KGC models. To surmount these challenges, we introduce the Contextualization Distillation strategy, a versatile plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models (LLMs) to transform compact, structural triplets into context-rich segments. Subsequently, we introduce two tailored auxiliary tasks—reconstruction and contextualization—allowing smaller KGC models to assimilate insights from these enriched triplets. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach, revealing consistent performance enhancements irrespective of underlying pipelines or architectures. Moreover, our analysis makes our method more explainable and provides insight into how to generate high-quality corpora for KGC, as well as the selection of suitable distillation tasks.
Defending Against Social Engineering Attacks in the Age of LLMs
Lin Ai | Tharindu Sandaruwan Kumarage | Amrita Bhattacharjee | Zizhou Liu | Zheng Hui | Michael S. Davinroy | James Cook | Laura Cassani | Kirill Trapeznikov | Matthias Kirchner | Arslan Basharat | Anthony Hoogs | Joshua Garland | Huan Liu | Julia Hirschberg
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Lin Ai | Tharindu Sandaruwan Kumarage | Amrita Bhattacharjee | Zizhou Liu | Zheng Hui | Michael S. Davinroy | James Cook | Laura Cassani | Kirill Trapeznikov | Matthias Kirchner | Arslan Basharat | Anthony Hoogs | Joshua Garland | Huan Liu | Julia Hirschberg
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Glue pizza and eat rocks - Exploiting Vulnerabilities in Retrieval-Augmented Generative Models
Zhen Tan | Chengshuai Zhao | Raha Moraffah | Yifan Li | Song Wang | Jundong Li | Tianlong Chen | Huan Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Zhen Tan | Chengshuai Zhao | Raha Moraffah | Yifan Li | Song Wang | Jundong Li | Tianlong Chen | Huan Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) by integrating external knowledge bases, improving their performance in applications like fact-checking and information searching. In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases by injecting deceptive content into the retrieval database, intentionally changing the model’s behavior. This threat is critical as it mirrors real-world usage scenarios where RAG systems interact with publicly accessible knowledge bases, such as web scrapings and user-contributed data pools. To be more realistic, we target a realistic setting where the adversary has no knowledge of users’ queries, knowledge base data, and the LLM parameters. We demonstrate that it is possible to exploit the model successfully through crafted content uploads with access to the retriever. Our findings emphasize an urgent need for security measures in the design and deployment of RAG systems to prevent potential manipulation and ensure the integrity of machine-generated content.
Exploiting Class Probabilities for Black-box Sentence-level Attacks
Raha Moraffah | Huan Liu
Findings of the Association for Computational Linguistics: EACL 2024
Raha Moraffah | Huan Liu
Findings of the Association for Computational Linguistics: EACL 2024
Sentence-level attacks craft adversarial sentences that are synonymous with correctly-classified sentences but are misclassified by the text classifiers. Under the black-box setting, classifiers are only accessible through their feedback to queried inputs, which is predominately available in the form of class probabilities. Even though utilizing class probabilities results in stronger attacks, due to the challenges of using them for sentence-level attacks, existing attacks use either no feedback or only the class labels. Overcoming the challenges, we develop a novel algorithm that uses class probabilities for black-box sentence-level attacks, investigate the effectiveness of using class probabilities on the attack’s success, and examine the question if it is worthy or practical to use class probabilities by black-box sentence-level attacks. We conduct extensive evaluations of the proposed attack comparing with the baselines across various classifiers and benchmark datasets.
2023
Towards Detecting Harmful Agendas in News Articles
Melanie Subbiah | Amrita Bhattacharjee | Yilun Hua | Tharindu Kumarage | Huan Liu | Kathleen McKeown
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Melanie Subbiah | Amrita Bhattacharjee | Yilun Hua | Tharindu Kumarage | Huan Liu | Kathleen McKeown
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Manipulated news online is a growing problem which necessitates the use of automated systems to curtail its spread. We argue that while misinformation and disinformation detection have been studied, there has been a lack of investment in the important open challenge of detecting harmful agendas in news articles; identifying harmful agendas is critical to flag news campaigns with the greatest potential for real world harm. Moreover, due to real concerns around censorship, harmful agenda detectors must be interpretable to be effective. In this work, we propose this new task and release a dataset, NewsAgendas, of annotated news articles for agenda identification. We show how interpretable systems can be effective on this task and demonstrate that they can perform comparably to black-box models.
Search
Fix author
Co-authors
- Zhen Tan 9
- Dawei Li 6
- Tianlong Chen 5
- Amrita Bhattacharjee 4
- Chengshuai Zhao 4
- Lu Cheng 3
- Bohan Jiang 3
- Alimohammad Beigi 2
- Jundong Li 2
- Raha Moraffah 2
- Song Wang 2
- Lin Ai 1
- Jae Young Baik 1
- Arslan Basharat 1
- Laura Cassani 1
- Aaron Chacko 1
- Yanfei Chen 1
- Canyu Chen 1
- James Cook 1
- Michael S. Davinroy 1
- Ying Ding 1
- Duy Duong-Tran 1
- Joshua Garland 1
- Ruocheng Guo 1
- Rujun Han 1
- Julia Hirschberg 1
- Anthony Hoogs 1
- Bojian Hou 1
- I-Hung Hsu 1
- Yilun Hua 1
- Liangjie Huang 1
- Zheng Hui 1
- Anand Rajan Iyer 1
- Yuxuan Jiang 1
- Mansooreh Karami 1
- Matthias Kirchner 1
- Tharindu Kumarage 1
- Tharindu Sandaruwan Kumarage 1
- Long Le 1
- Joseph Lee 1
- George Lee 1
- Chen-Yu Lee 1
- Yifan Li 1
- Zizhou Liu 1
- Pingchuan Ma 1
- Kathleen McKeown 1
- Hamid Palangi 1
- Tomas Pfister 1
- Li Shen 1
- Kai Shu 1
- Yiwen Song 1
- Melanie Subbiah 1
- Anique Tahir 1
- Kirill Trapeznikov 1
- Yancheng Wang 1
- Zifeng Wang 1
- Chau-Wai Wong 1
- Tianhao Wu 1
- Jun Yan 1
- Shu Yang 1
- Yingzhen Yang 1
- Yuguang Yao 1
- Sukwon Yun 1
- Xinyan Zhao 1