Jinjie Gu
2026
Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training
Lei Liu | Hao Zhu | Xiaoyan Yang | Yue Shen | Zhixuan Chu | Jian Wang | Jinjie Gu | Kui Ren
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lei Liu | Hao Zhu | Xiaoyan Yang | Yue Shen | Zhixuan Chu | Jian Wang | Jinjie Gu | Kui Ren
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Continual Pre-training (CPT) serves as a fundamental approach for adapting foundation models to domain-specific applications. Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM. However, the marginal gains from simply increasing data for CPT diminish rapidly, yielding suboptimal data utilization and inefficient training. To address this challenge, we propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss. Our approach leverages the pre-trained model’s own perplexity on domain data as a proxy for estimating the knowledge gap, effectively quantifying the informational perplexity landscape of candidate training samples. By fitting this scaling law across diverse perplexity regimes, we enable adaptive selection of high-utility data subsets, prioritizing content that maximizes knowledge absorption while minimizing redundancy and noise. Extensive experiments on both medical and general-domain benchmarks demonstrate that our method consistently identifies near-optimal training subsets, achieving superior performance with significantly reduced data consumption.
WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning
Junjie Wang | Zequn Xie | Dan Yang | Jie Feng | Yue Shen | Duolin Sun | Meixiu Long | Yihan Jiao | Zhehao Tan | Jian Wang | Peng Wei | Jinjie Gu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Junjie Wang | Zequn Xie | Dan Yang | Jie Feng | Yue Shen | Duolin Sun | Meixiu Long | Yihan Jiao | Zhehao Tan | Jian Wang | Peng Wei | Jinjie Gu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Deep Research systems based on web agents have shown strong potential in solving complex information-seeking tasks, yet their search efficiency remains underexplored. We observe that many state-of-the-art open-source web agents rely on long tool-call trajectories with cyclic reasoning loops and exploration of unproductive branches. To address this, we propose WebClipper, a framework that compresses web agent trajectories via graph-based pruning. Concretely, we model the agent’s search process as a state graph and cast trajectory optimization as a minimum-necessary Directed Acyclic Graph (DAG) mining problem, yielding pruned trajectories that preserve essential reasoning while eliminating redundant steps. Continued training on these refined trajectories enables the agent to evolve toward more efficient search patterns and reduces tool-call rounds by about 20% while improving accuracy. Furthermore, we introduce a new metric called F-AE Score to measure the model’s overall performance in balancing accuracy and efficiency. Experiments demonstrate that WebClipper compresses tool-call rounds under excellent performance, providing practical insight into balancing effectiveness and efficiency in web agent design.
HAD: HAllucination Detection Language Models Based on a Comprehensive Hallucination Taxonomy
Fan Xu | Xinyu Hu | Zhenghan Yu | Li Lin | Xu Zhang | Yang Zhang | Wei Zhou | Jinjie Gu | Xiaojun Wan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Fan Xu | Xinyu Hu | Zhenghan Yu | Li Lin | Xu Zhang | Yang Zhang | Wei Zhou | Jinjie Gu | Xiaojun Wan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
The increasing reliance on natural language generation (NLG) models, particularly large language models, has raised concerns about the reliability and accuracy of their outputs. A key challenge is hallucination, where models produce plausible but incorrect information. As a result, hallucination detection has become a critical task. In this work, we introduce a comprehensive hallucination taxonomy with 11 categories across various NLG tasks and propose the HAllucination Detection (HAD) models, which integrate hallucination detection, span-level identification, and correction into a single inference process. Trained on an elaborate synthetic dataset of about 90K samples, our HAD models are versatile and can be applied to various NLG tasks. We also carefully annotate a test set for hallucination detection, called HADTest, which contains 2,248 samples. Evaluations on in-domain and out-of-domain test sets show that our HAD models generally outperform the existing baselines, achieving state-of-the-art results on HaluEval, FactCHD, and FaithBench, confirming their robustness and versatility.
2025
CPRM: A LLM-based Continual Pre-training Framework for Relevance Modeling in Commercial Search
Kaixin Wu | Yixin Ji | Zeyuan Chen | Qiang Wang | Cunxiang Wang | Hong Liu | Baijun Ji | Xu Jia | Zhongyi Liu | Jinjie Gu | Yuan Zhou | Linjian Mo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Kaixin Wu | Yixin Ji | Zeyuan Chen | Qiang Wang | Cunxiang Wang | Hong Liu | Baijun Ji | Xu Jia | Zhongyi Liu | Jinjie Gu | Yuan Zhou | Linjian Mo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Relevance modeling between queries and items stands as a pivotal component in commercial search engines, directly affecting the user experience. Given the remarkable achievements of large language models (LLMs) in various natural language processing (NLP) tasks, LLM-based relevance modeling is gradually being adopted within industrial search systems. Nevertheless, foundational LLMs lack domain-specific knowledge and do not fully exploit the potential of in-context learning. Furthermore, structured item text remains underutilized, and there is a shortage in the supply of corresponding queries and background knowledge. We thereby propose CPRM (Continual Pre-training for Relevance Modeling), a framework designed for the continual pre-training of LLMs to address these issues. Our CPRM framework includes three modules: 1) employing both queries and multi-field item to jointly pre-train for enhancing domain knowledge, 2) applying in-context pre-training, a novel approach where LLMs are pre-trained on a sequence of related queries or items, and 3) conducting reading comprehension on items to produce associated domain knowledge and background information (e.g., generating summaries and corresponding queries) to further strengthen LLMs. Results on offline experiments and online A/B testing demonstrate that our model achieves convincing performance compared to strong baselines.
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents
Yuqi Zhu | Shuofei Qiao | Yixin Ou | Shumin Deng | Shiwei Lyu | Yue Shen | Lei Liang | Jinjie Gu | Huajun Chen | Ningyu Zhang
Findings of the Association for Computational Linguistics: NAACL 2025
Yuqi Zhu | Shuofei Qiao | Yixin Ou | Shumin Deng | Shiwei Lyu | Yue Shen | Lei Liang | Jinjie Gu | Huajun Chen | Ningyu Zhang
Findings of the Association for Computational Linguistics: NAACL 2025
Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories during task solving and results in planning hallucination. To address this issue, we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents. Experimental results on HotpotQA and ALFWorld based on various backbone models demonstrate that KnowAgent can achieve comparable or superior performance to existing baselines. Further analysis indicates the effectiveness of KnowAgent in terms of planning hallucinations mitigation.
2024
Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs
Junjie Wang | Mingyang Chen | Binbin Hu | Dan Yang | Ziqi Liu | Yue Shen | Peng Wei | Zhiqiang Zhang | Jinjie Gu | Jun Zhou | Jeff Z. Pan | Wen Zhang | Huajun Chen
Findings of the Association for Computational Linguistics: EMNLP 2024
Junjie Wang | Mingyang Chen | Binbin Hu | Dan Yang | Ziqi Liu | Yue Shen | Peng Wei | Zhiqiang Zhang | Jinjie Gu | Jun Zhou | Jeff Z. Pan | Wen Zhang | Huajun Chen
Findings of the Association for Computational Linguistics: EMNLP 2024
Improving the performance of large language models (LLMs) in complex question-answering (QA) scenarios has always been a research focal point. Recent studies have attempted to enhance LLMs’ performance by combining step-wise planning with external retrieval. While effective for advanced models like GPT-3.5, smaller LLMs face challenges in decomposing complex questions, necessitating supervised fine-tuning. Previous work has relied on manual annotation and knowledge distillation from teacher LLMs, which are time-consuming and not accurate enough. In this paper, we introduce a novel framework for enhancing LLMs’ planning capabilities by using planning data derived from knowledge graphs (KGs). LLMs fine-tuned with this data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval. Evaluations on multiple datasets, including our newly proposed benchmark, highlight the effectiveness of our framework and the benefits of KG-derived planning data.
Editing Conceptual Knowledge for Large Language Models
Xiaohan Wang | Shengyu Mao | Shumin Deng | Yunzhi Yao | Yue Shen | Lei Liang | Jinjie Gu | Huajun Chen | Ningyu Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024
Xiaohan Wang | Shengyu Mao | Shumin Deng | Yunzhi Yao | Yue Shen | Lei Liang | Jinjie Gu | Huajun Chen | Ningyu Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024
Recently, there has been a growing interest in knowledge editing for Large Language Models (LLMs). Current approaches and evaluations merely explore the instance-level editing, while whether LLMs possess the capability to modify concepts remains unclear. This paper pioneers the investigation of editing conceptual knowledge for LLMs, by constructing a novel benchmark dataset ConceptEdit and establishing a suite of new metrics for evaluation. The experimental results reveal that, although existing editing methods can efficiently modify concept-level definition to some extent, they also have the potential to distort the related instantial knowledge in LLMs, leading to poor performance. We anticipate this work can inspire further progress in understanding LLMs.
Unified Hallucination Detection for Multimodal Large Language Models
Xiang Chen | Chenxi Wang | Yida Xue | Ningyu Zhang | Xiaoyan Yang | Qiang Li | Yue Shen | Lei Liang | Jinjie Gu | Huajun Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiang Chen | Chenxi Wang | Yida Xue | Ningyu Zhang | Xiaoyan Yang | Qiang Li | Yue Shen | Lei Liang | Jinjie Gu | Huajun Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite significant strides in multimodal tasks, Multimodal Large Language Models (MLLMs) are plagued by the critical issue of hallucination. The reliable detection of such hallucinations in MLLMs has, therefore, become a vital aspect of model evaluation and the safeguarding of practical application deployment. Prior research in this domain has been constrained by a narrow focus on singular tasks, an inadequate range of hallucination categories addressed, and a lack of detailed granularity. In response to these challenges, our work expands the investigative horizons of hallucination detection. We present a novel meta-evaluation benchmark, MHaluBench, meticulously crafted to facilitate the evaluation of advancements in hallucination detection methods. Additionally, we unveil a novel unified multimodal hallucination detection framework, UNIHD, which leverages a suite of auxiliary tools to validate the occurrence of hallucinations robustly. We demonstrate the effectiveness of UNIHD through meticulous evaluation and comprehensive analysis. We also provide strategic insights on the application of specific tools for addressing various categories of hallucinations.
CharPoet: A Chinese Classical Poetry Generation System Based on Token-free LLM
Chengyue Yu | Lei Zang | Jiaotuan Wang | Chenyi Zhuang | Jinjie Gu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Chengyue Yu | Lei Zang | Jiaotuan Wang | Chenyi Zhuang | Jinjie Gu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Automatic Chinese classical poetry generation has attracted much research interest, but achieving effective control over format and content simultaneously remains challenging. Traditional systems usually accept keywords as user inputs, resulting in limited control over content. Large language models (LLMs) improve content control by allowing unrestricted user instructions, but the token-by-token generation process frequently makes format errors. Motivated by this, we propose CharPoet, a Chinese classical poetry generation system based on token-free LLM, which provides effective control over both format and content. Our token-free architecture generates in a character-by-character manner, enabling precise control over the number of characters. Pruned from existing token-based LLMs, CharPoet inherits their pretrained capabilities and can generate poetry following instructions like “Write me a poem for my mother’s birthday.” CharPoet achieves format accuracy above 0.96, outperforming Jiuge-GPT-2 (0.91) and GPT-4 (0.38). In terms of content quality, CharPoet surpasses traditional systems including Jiuge, and is comparable to other LLMs. Our system is open source and available at https://modelscope.cn/models/CharPoet/CharPoet. A video demonstration of CharPoet is available at https://youtu.be/voZ25qEp3Dc.
Search
Fix author
Co-authors
- Yue Shen 6
- Huajun Chen 4
- Lei Liang 3
- Ningyu Zhang 3
- Shumin Deng 2
- Jian Wang 2
- Peng Wei 2
- Dan Yang 2
- Mingyang Chen 1
- Xiang Chen 1
- Zeyuan Chen 1
- Zhixuan Chu 1
- Jie Feng 1
- Binbin Hu 1
- Xinyu Hu 1
- Baijun Ji 1
- Yixin Ji (纪一心) 1
- Xu Jia 1
- Yihan Jiao 1
- Qiang Li 1
- Li Lin 1
- Hong Liu 1
- Lei Liu 1
- Zhongyi Liu 1
- Ziqi Liu 1
- Meixiu Long 1
- Shiwei Lyu 1
- Shengyu Mao 1
- Linjian Mo 1
- Yixin Ou 1
- Jeff Z. Pan 1
- Shuofei Qiao 1
- Kui Ren 1
- Duolin Sun 1
- Zhehao Tan 1
- Xiaojun Wan 1
- Chenxi Wang 1
- Cunxiang Wang 1
- Jiaotuan Wang 1
- Junjie Wang 1
- Junjie Wang 1
- Qiang Wang 1
- Xiaohan Wang 1
- Kaixin Wu 1
- Zequn Xie 1
- Fan Xu (徐凡) 1
- Yida Xue 1
- Xiaoyan Yang 1
- Xiaoyan Yang 1
- Yunzhi Yao 1
- Chengyue Yu 1
- Zhenghan Yu 1
- Lei Zang 1
- Wen Zhang 1
- Xu Zhang 1
- Yang Zhang 1
- Zhiqiang Zhang 1
- Jun Zhou 1
- Wei Zhou 1
- Yuan Zhou 1
- Hao Zhu 1
- Yuqi Zhu 1
- Chenyi Zhuang 1