Ge Yu
Also published as: 戈 于
2026
MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization
Haidong Xin | Xinze Li | Zhenghao Liu | Yukun Yan | Shuo Wang | Cheng Yang | Yu Gu | Ge Yu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2026
Haidong Xin | Xinze Li | Zhenghao Liu | Yukun Yan | Shuo Wang | Cheng Yang | Yu Gu | Ge Yu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2026
Existing memory systems enable Large Language Models (LLMs) to support long-horizon human-LLM interactions by persisting historical interactions beyond limited context windows. However, while recent approaches have succeeded in constructing effective memories, they often disrupt the inherent logical and temporal relationships within interaction sessions, resulting in fragmented memory units and degraded reasoning performance. In this paper, we propose MetaMem, a novel framework that augments memory systems with a self-evolving meta-memory, aiming to teach LLMs how to effectively utilize memorized knowledge. During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks by self-reflecting on reasoning processes and performing actions to update the current meta-memory state. The accumulated meta-memory units serve as explicit knowledge utilization experiences, guiding the LLM to systematically identify and integrate critical evidence from scattered memory fragments. Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%. All codes and datasets are available at https://github.com/OpenBMB/MetaMem.
ThinkNote: Enhancing Knowledge Integration and Utilization of Large Language Models via Constructivist Cognition Modeling
Zhipeng Xu | Zhenghao Liu | Yukun Yan | Shuo Wang | Shi Yu | Zheni Zeng | Chaojun Xiao | Zhiyuan Liu | Ge Yu | Chenyan Xiong
Findings of the Association for Computational Linguistics: EACL 2026
Zhipeng Xu | Zhenghao Liu | Yukun Yan | Shuo Wang | Shi Yu | Zheni Zeng | Chaojun Xiao | Zhiyuan Liu | Ge Yu | Chenyan Xiong
Findings of the Association for Computational Linguistics: EACL 2026
Large Language Models (LLMs) have demonstrated strong performance across a wide range of NLP tasks. However, they often exhibit suboptimal behaviors and inconsistencies when exposed to unfamiliar external information, underscoring their limitations in effectively leveraging such knowledge. Inspired by constructivist learning theory, we propose ThinkNote, a novel framework that enhances the external knowledge utilization of LLMs through a two-stage constructivist cognitive modeling process. Specifically, ThinkNote performs knowledge assimilation to align new information with the model’s parametric memory, forming a coherent internal representation. It then applies thought accommodation to adapt internal reasoning, thereby promoting more consistent and reliable outputs. Extensive experimental results demonstrate that ThinkNote achieves a 10% improvement over strong baseline methods on various question-answering benchmarks. Further analysis indicates that ThinkNote effectively integrates and utilizes external knowledge to help LLMs generate accurate responses and improves their self-consistency. All data and code will be publicly available at https://github.com/OpenMatch/ThinkNote.
Towards Efficient and Effective Diffusion Language Model Inference via Semantic-Aware Adaptive Denoising
Fan Li | Yu Gu | Zhigang Wang | Fangling Leng | Zhenghao Liu | Ge Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fan Li | Yu Gu | Zhigang Wang | Fangling Leng | Zhenghao Liu | Ge Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Diffusion language models (DLMs) have emerged as a powerful non-autoregressive alternative to GPT-style sequential generation, but suffer from substantial computational overhead due to their iterative parallel denoising. Existing acceleration works cannot accurately detect semantically stabilized tokens and then skip computation, leading to sub-optimal speedup in practice. This paper presents the first systematic study of convergence dynamics in DLMs. Innovative observations include the misalignment between traditionally used scalar detection criterion and the semantic convergence, and the post-peak confidence score, that wastes denoising computation and degrades inference quality. To address these limitations, we propose Ada-DLM, a semantic-aware adaptive denoising framework that encodes the trajectory of scalar confidence scores into an evolution-aware feature vector and then clusters vectors proactively and adaptively identify semantically converged tokens. Furthermore, we incorporate system-level optimizations to maximize runtime efficiency. Experiments show that Ada-DLM consistently outperforms the SOTA competitor, achieving up to 2x speedup and 19% quality improvement. That offers a practical path toward efficient high-quality DLM deployment.
UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents
Yifan Ji | Zhipeng Xu | Zhenghao Liu | Zulong Chen | Qian Zhang | Zhibo Yang | Junyang Lin | Yu Gu | Ge Yu | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifan Ji | Zhipeng Xu | Zhenghao Liu | Zulong Chen | Qian Zhang | Zhibo Yang | Junyang Lin | Yu Gu | Ge Yu | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Key Information Extraction (KIE) from real-world documents remains challenging due to substantial variations in layout structures, visual quality, and task-specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end-to-end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE-BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout-aware reasoning for LMM-based KIE. All codes and datasets are available at https://github.com/NEUIR/UNIKIE-BENCH.
Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection
Zhuoyang Wu | Xinze Li | Zhenghao Liu | Yukun Yan | Zhiyuan Liu | Minghe Yu | Cheng Yang | Yu Gu | Ge Yu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2026
Zhuoyang Wu | Xinze Li | Zhenghao Liu | Yukun Yan | Zhiyuan Liu | Minghe Yu | Cheng Yang | Yu Gu | Ge Yu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) have exhibited strong reasoning capabilities and achieved remarkable performance in mathematical problem-solving tasks. Recently, distilling reasoning ability from long-form Chains-of-Thought (CoTs) has emerged as a promising approach for enhancing Small Language Models (SLMs). Existing studies typically treat SLMs as student models and use long-form CoTs as supervision signals for Supervised Fine-Tuning (SFT) to transfer reasoning ability. However, such long-form CoT teachers are usually unaware of the student model’s capacity, which limits the effective utilization of the provided reasoning traces. To overcome this limitation, we propose error-aware self-reflection (ORION), a framework that refines teacher CoTs through an Error-Aware Reflection process. ORION enables the student model to construct more tailored teacher CoTs by refining teacher CoTs and incorporating its own reasoning errors. Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines. Further analysis reveals that the CoTs constructed by ORION exhibit higher coherence and logical consistency, thereby serving as more effective supervision signals for SFT. All codes are available at https://github.com/NEUIR/ORION.
Revealing the Attention Floating Mechanism in Masked Diffusion Models
Xin Dai | Pengcheng Huang | Zhenghao Liu | Shuo Wang | Yukun Yan | Chaojun Xiao | Yu Gu | Ge Yu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2026
Xin Dai | Pengcheng Huang | Zhenghao Liu | Shuo Wang | Yukun Yan | Chaojun Xiao | Yu Gu | Ge Yu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2026
Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes and datasets will be available via GitHub.
HqeKV: Towards Hybrid Quantization and Eviction for KV Cache in Long-Context LLM Inference
He Wang | Yu Gu | Fangfang Li | Zhigang Wang | Zhenghao Liu | Ning Wang | Xiaohua Li | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2026
He Wang | Yu Gu | Fangfang Li | Zhigang Wang | Zhenghao Liu | Ning Wang | Xiaohua Li | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2026
The autoregressive inference in large language models requires repeated computation across transformer layers. While caching intermediate key-value (KV) pairs eliminates redundancy, it introduces severe memory overhead, particularly in long-context settings. Most existing cache compression methods operate solely on either quantization or eviction, based on importance estimation of cached data. However, they are limited by coarse compression choices and inaccurate importance assessment, leading to suboptimal inference quality. To address this, we propose HqeKV, a hybrid compression framework built on both quantization and eviction, offering finer-grained compression options that adapt smoothly to the varying importance of cached KV pairs. An integrated optimizer automatically selects the best compression action for each cached element, maximizing quality while insulating end-users from tedious low-level tuning details. We further design a joint K–V importance metric to provide more accurate importance assessment results so that the optimizer can make smarter decisions. Additionally, HqeKV supports flexible conversion policies across multiple quantization precision levels, to further reduce quality degradation. Extensive experiments show that HqeKV improves output quality under the same memory constraints, outperforming state-of-the-art alternatives. Code is available at https://github.com/skywclouds/HqeKV.
Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization
Shaohua Duan | Pengcheng Huang | Xinze Li | Zhenghao Liu | Xiaoyuan Yi | Yukun Yan | Shuo Wang | Yu Gu | Ge Yu | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shaohua Duan | Pengcheng Huang | Xinze Li | Zhenghao Liu | Xiaoyuan Yi | Yukun Yan | Shuo Wang | Yu Gu | Ge Yu | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Specifically, we treat context chunks as arms of MAB, select chunks based on their expected reward scores to input into LLMs to generate responses, and iteratively update these scores based on reward feedback. Both exploration and exploitation during the rollout process enable the LLM to focus on the most relevant context segments, thereby generating and collecting high-quality and diverse responses. Experimental results on both Llama and Qwen show the effectiveness of LongMab by achieving more than a 4% improvement on long-context reasoning benchmarks. All data and code will be released on https://github.com/NEUIR/LongMab-PO.
Long-Chain Reasoning Distillation via Adaptive Prefix Alignment
Zhenghao Liu | Zhuoyang Wu | Xinze Li | Yukun Yan | Shuo Wang | Zulong Chen | Yu Gu | Ge Yu | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhenghao Liu | Zhuoyang Wu | Xinze Li | Yukun Yan | Shuo Wang | Zulong Chen | Yu Gu | Ge Yu | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in solving complex mathematical problems. Recent studies show that distilling long reasoning trajectories can effectively enhance the reasoning performance of small-scale student models. However, teacher-generated reasoning trajectories are often excessively long and structurally complex, making them difficult for student models to learn. This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix-ALIGNment distillation (P-ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. Specifically, P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise and sufficient to guide the student model. Then, P-ALIGN leverages the teacher-generated prefix to supervise the student model, encouraging effective prefix alignment. Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%. Further analysis indicates that the prefixes constructed by P-ALIGN provide more effective supervision signals, while avoiding the negative impact of redundant and uncertain reasoning components. All codes are available at https://github.com/NEUIR/P-ALIGN.
Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation
Haonan Shangguan | Xiaocui Yang | Shi Feng | Daling Wang | Yifei Zhang | Feiliang Ren | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2026
Haonan Shangguan | Xiaocui Yang | Shi Feng | Daling Wang | Yifei Zhang | Feiliang Ren | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2026
Current approaches for Multimodal Sentiment Analysis (MSA) primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) LLMs for classification, overlooking autonomous multimodal sentiment reasoning generation in resource-constrained environments.In this paper, we focus on the Resource-Limited Joint Multimodal Sentiment Reasoning and Classification task, JMSRC, which simultaneously performs multimodal sentiment reasoning chain generation and sentiment classification only with a lightweight model.We propose a Multimodal Chain-of-Thought Reasoning Distillation model, MulCoT-RD, designed for JMSRC that employs a "Teacher-Assistant-Student" distillation paradigm to address deployment constraints in resource-limited environments.We first leverage a high-performance Multimodal Large Language Model (MLLM) to generate the initial reasoning dataset and train a medium-sized assistant model with a multi-task learning mechanism. A lightweight student model is jointly trained to perform efficient multimodal sentiment reasoning generation and classification.Extensive experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters and achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.
Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains
Yuqi Xiong | Chunyi Peng | Zhipeng Xu | Zhenghao Liu | Zulong Chen | Yukun Yan | Shuo Wang | Yu Gu | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2026
Yuqi Xiong | Chunyi Peng | Zhipeng Xu | Zhenghao Liu | Zulong Chen | Yukun Yan | Shuo Wang | Yu Gu | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2026
Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at https://github.com/NEUIR/Lang2Act.
Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling
Shuliang Liu | Zhipeng Xu | Zhenghao Liu | Yukun Yan | Minghe Yu | Yu Gu | Chong Chen | Huiyuan Xie | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2026
Shuliang Liu | Zhipeng Xu | Zhenghao Liu | Yukun Yan | Minghe Yu | Yu Gu | Chong Chen | Huiyuan Xie | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM-as-a-Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM-based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM-based judgment models into a multi-agent system and simulates the interactive client-server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human-labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM-based judgment models, demonstrating its effectiveness. All codes are available at https://github.com/NEUIR/Genii.
2025
RankCoT: Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts
Mingyan Wu | Zhenghao Liu | Yukun Yan | Xinze Li | Shi Yu | Zheni Zeng | Yu Gu | Ge Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mingyan Wu | Zhenghao Liu | Yukun Yan | Xinze Li | Shi Yu | Zheni Zeng | Yu Gu | Ge Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-Augmented Generation (RAG) enhances the performance of Large Language Models (LLMs) by incorporating external knowledge. However, LLMs still encounter challenges in effectively utilizing the knowledge from retrieved documents, often being misled by irrelevant or noisy information. To address this issue, we introduce RankCoT, a knowledge refinement method that incorporates reranking signals in generating CoT-based summarization for knowledge refinement based on given query and all retrieval documents. During training, RankCoT prompts the LLM to generate Chain-of-Thought (CoT) candidates based on the query and individual documents. It then fine-tunes the LLM to directly reproduce the best CoT from these candidate outputs based on all retrieved documents, which requires LLM to filter out irrelevant documents during generating CoT-style summarization. Additionally, RankCoT incorporates a self-reflection mechanism that further refines the CoT outputs, resulting in higher-quality training data. Our experiments demonstrate the effectiveness of RankCoT, showing its superior performance over other knowledge refinement models. Further analysis reveals that RankCoT can provide shorter but effective refinement results, enabling the generator to produce more accurate answers. All code and data are available at https://github.com/NEUIR/RankCoT.
MeMoTune: A Measure and Moment-Driven Fine-Tuning Framework for Quantized Large Language Models
Yun Zhang | Xue Geng | Lizi Liao | Jintong Sun | Minghe Yu | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2025
Yun Zhang | Xue Geng | Lizi Liao | Jintong Sun | Minghe Yu | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2025
Quantizing large language models (LLMs) is essential for reducing memory and computational costs in natural language processing. Existing methods combine quantization with parameter-efficient fine-tuning but often fail to meet practical performance requirements. This paper introduces MeMoTune, a novel fine-tuning framework for quantized LLMs. By employing a measure and moment approach within a low-rank approximation framework in probability measure space, MeMoTune optimizes the objective function for superior fine-tuning results. The update process is further refined through scaled gradient, enhancing convergence efficiency and noise robustness. Experiments on tasks like text generation, summarization, and understanding show MeMoTune significantly outperforms state-of-the-art methods, e.g. fine-tuning Llama2-13B on GSM8K improves accuracy by 5.5%, while fine-tuning DeBERTaV3-base on CoLA of GLUE increases Matthews correlation by 1.7%. The code is publicly available at: https://github.com/hddyyyb/MeMoTune.
ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization
Zhensheng Jin | Xinze Li | Yifan Ji | Chunyi Peng | Zhenghao Liu | Qi Shi | Yukun Yan | Shuo Wang | Furong Peng | Ge Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Zhensheng Jin | Xinze Li | Yifan Ji | Chunyi Peng | Zhenghao Liu | Qi Shi | Yukun Yan | Shuo Wang | Furong Peng | Ge Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression Through Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)—one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
Shuliang Liu | Xinze Li | Zhenghao Liu | Yukun Yan | Cheng Yang | Zheni Zeng | Zhiyuan Liu | Maosong Sun | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2025
Shuliang Liu | Xinze Li | Zhenghao Liu | Yukun Yan | Cheng Yang | Zheni Zeng | Zhiyuan Liu | Maosong Sun | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2025
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilizes the judge-consistency to evaluate these judgments, and selects the chosen and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.
ExpandR: Teaching Dense Retrievers Beyond Queries with LLM Guidance
Sijia Yao | Pengcheng Huang | Zhenghao Liu | Yu Gu | Yukun Yan | Shi Yu | Ge Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Sijia Yao | Pengcheng Huang | Zhenghao Liu | Yu Gu | Yukun Yan | Shi Yu | Ge Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have demonstrated significant potential in enhancing dense retrieval through query augmentation. However, most existing methods treat the LLM and the retriever as separate modules, overlooking the alignment between generation and ranking objectives. In this work, we propose ExpandR, a unified LLM-augmented dense retrieval framework that jointly optimizes both the LLM and the retriever. ExpandR employs the LLM to generate semantically rich query expansions, which are leveraged to enhance the retriever’s training. Simultaneously, the LLM is trained using Direct Preference Optimization (DPO), guided by a carefully designed reward function that balances retrieval effectiveness and generation consistency. This joint optimization paradigm enables mutual adaptation between the LLM and the retriever, resulting in query expansions that are both informative and well-suited for retrieval. Experimental results on multiple benchmarks show that ExpandR consistently outperforms strong baselines, achieving more than a 5% improvement in retrieval performance. All codes are available at https://github.com/NEUIR/ExpandR.
COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis
Weiqing Yang | Hanbin Wang | Zhenghao Liu | Xinze Li | Yukun Yan | Shuo Wang | Yu Gu | Minghe Yu | Zhiyuan Liu | Ge Yu
Findings of the Association for Computational Linguistics: NAACL 2025
Weiqing Yang | Hanbin Wang | Zhenghao Liu | Xinze Li | Yukun Yan | Shuo Wang | Yu Gu | Minghe Yu | Zhiyuan Liu | Ge Yu
Findings of the Association for Computational Linguistics: NAACL 2025
Code debugging is a vital stage of software development, essential for ensuring the reliability and performance of Large Language Models (LLMs) in the code generation task. Human debugging typically follows a multi-stage process, which includes Bug Localization, Bug Identification, Code Repair, and Code Recognition. However, existing code debugging benchmarks predominantly focus on the Code Repair stage, which offers only a limited perspective on evaluating the debugging capabilities of LLMs. In this paper, we introduce DEBUGEVAL, a comprehensive benchmark for evaluating the debugging abilities of LLMs by emulating the multi-stage human debugging process. Through evaluating on DEBUGEVAL, we observe that 7B-scale models consistently underperform compared to their larger counterparts, highlighting their limitations in comprehending code semantics. In this case, we propose the COmmunicative Agent-based data SynThesis (COAST) framework, which employs a multi-agent system to generate high-quality training data for supervised fine-tuning (SFT). Experimental results demonstrate that COAST-generated data outperform human-curated and GPT-4-generated data, enabling 7B-scale LLMs to achieve debugging performance comparable to GPT-3.5. All data and codes are available at https://github.com/NEUIR/COAST.
2024
MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin
Tianshuo Zhou | Sen Mei | Xinze Li | Zhenghao Liu | Chenyan Xiong | Zhiyuan Liu | Yu Gu | Ge Yu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianshuo Zhou | Sen Mei | Xinze Li | Zhenghao Liu | Chenyan Xiong | Zhiyuan Liu | Yu Gu | Ge Yu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin (MARVEL), which learns an embedding space for queries and multi-modal documents to conduct retrieval. MARVEL encodes queries and multi-modal documents with a unified encoder model, which helps to alleviate the modality gap between images and texts. Specifically, we enable the image understanding ability of the well-trained dense retriever, T5-ANCE, by incorporating the visual module’s encoded image features as its inputs. To facilitate the multi-modal retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22 dataset, which regards anchor texts as queries, and extracts the related text and image documents from anchor-linked web pages. Our experiments show that MARVEL significantly outperforms the state-of-the-art methods on the multi-modal retrieval dataset WebQA and ClueWeb22-MM. MARVEL provides an opportunity to broaden the advantages of text retrieval to the multi-modal scenario. Besides, we also illustrate that the language model has the ability to extract image semantics and partly map the image features to the input word embedding space. All codes are available at https://github.com/OpenMatch/MARVEL.
Cleaner Pretraining Corpus Curation with Neural Web Scraping
Zhipeng Xu | Zhenghao Liu | Yukun Yan | Zhiyuan Liu | Ge Yu | Chenyan Xiong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Zhipeng Xu | Zhenghao Liu | Yukun Yan | Zhiyuan Liu | Ge Yu | Chenyan Xiong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
The web contains large-scale, diverse, and abundant information to satisfy the information-seeking needs of humans. Through meticulous data collection, preprocessing, and curation, webpages can be used as a fundamental data resource for language model pretraining. However, when confronted with the progressively revolutionized and intricate nature of webpages, rule-based/feature-based web scrapers are becoming increasingly inadequate. This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages. Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement, demonstrating its potential in extracting higher-quality data to facilitate the language model pretraining. All of the code is available at https://github.com/OpenMatch/NeuScraper.
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair
Hanbin Wang | Zhenghao Liu | Shuo Wang | Ganqu Cui | Ning Ding | Zhiyuan Liu | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2024
Hanbin Wang | Zhenghao Liu | Shuo Wang | Ganqu Cui | Ning Ding | Zhiyuan Liu | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2024
This paper introduces INTERVENOR (INTERactiVE chaiN Of Repair), a system designed to emulate the interactive code repair processes observed in humans, encompassing both code diagnosis and code repair. INTERVENOR prompts Large Language Models (LLMs) to play distinct roles during the code repair process, functioning as both a Code Learner and a Code Teacher. Specifically, the Code Learner is tasked with adhering to instructions to generate or repair code, while the Code Teacher is responsible for crafting a Chain-of-Repair (CoR) to serve as guidance for the Code Learner. During generating the CoR, the Code Teacher needs to check the generated codes from Code Learner and reassess how to address code bugs based on error feedback received from compilers. Experimental results demonstrate that INTERVENOR surpasses baseline models, exhibiting improvements of approximately 18% and 4.3% over GPT-3.5 in code generation and code translation tasks, respectively. Our further analyses show that CoR is effective to illuminate the reasons behind bugs and outline solution plans in natural language. With the feedback of code compilers, INTERVENOR can accurately identify syntax errors and assertion errors and provide precise instructions to repair codes. All data and codes are available at [https://github.com/NEUIR/INTERVENOR](https://github.com/NEUIR/INTERVENOR).
Self-Guide:一种基于自我规划的大语言模型推理增强方法(Self-Guide: Enhancing LLM Reasoning Ability via Self-Plan)
Yibin Liu (刘艺彬) | Zhenghao Liu (刘正皓) | Yukun Yan (闫宇坤) | Shi Yu (于是) | Shuo Wang (王硕) | Liner Yang (杨麟儿) | Huimin Chen (陈慧敏) | Yu Gu (谷峪) | Ge Yu (于戈)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Yibin Liu (刘艺彬) | Zhenghao Liu (刘正皓) | Yukun Yan (闫宇坤) | Shi Yu (于是) | Shuo Wang (王硕) | Liner Yang (杨麟儿) | Huimin Chen (陈慧敏) | Yu Gu (谷峪) | Ge Yu (于戈)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
“尽管大语言模型在自然语言处理任务中取得显著进展,但其在复杂问题推理等领域还面临着认知负荷问题,即大语言模型在推理过程需要记忆并处理大量信息。因此,如何有效地减少语言模型推理过程中的认知负荷,缓解推理过程中可能出现的认知过载是一个亟待解决的问题。对此本文提出了Self-Guide方法,用于增强语言模型的推理能力。该方法通过指引大语言模型生成常识知识和推理指导,让语言模型基于自我规划来增强其推理能力,并通过与推理链结合的方式对模型的推理过程进行校准。与现有方法不同的是,本文在不对大语言模型进行微调或使用外部工具的情况下,显著提升了语言模型的推理性能。实验结果表明,Self-Guide方法在四种常见推理任务上性能显著优于基线方法,同时相比传统的推理链模型,Self-Guide方法在推理能力较弱的模型上也具有良好的泛化性能。通过结合大语言模型的自我规划和推理能力,Self-Guide方法为提升语言模型的推理能力提供了一种新的有效途径。”
2023
Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data
Xinze Li | Zhenghao Liu | Chenyan Xiong | Shi Yu | Yu Gu | Zhiyuan Liu | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2023
Xinze Li | Zhenghao Liu | Chenyan Xiong | Shi Yu | Yu Gu | Zhiyuan Liu | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2023
This paper presents Structure Aware Dense Retrieval (SANTA) model, which encodes user queries and structured data in one universal embedding space for retrieving structured data. SANTA proposes two pretraining methods to make language models structure-aware and learn effective representations for structured data: 1) Structured Data Alignment, which utilizes the natural alignment relations between structured data and unstructured data for structure-aware pretraining. It contrastively trains language models to represent multi-modal text data and teaches models to distinguish matched structured data for unstructured texts. 2) Masked Entity Prediction, which designs an entity-oriented mask strategy and asks language models to fill in the masked entities. Our experiments show that SANTA achieves state-of-the-art on code search and product search and conducts convincing results in the zero-shot setting. SANTA learns tailored representations for multi-modal text data by aligning structured and unstructured data pairs and capturing structural semantics by masking and predicting entities in the structured data. All codes are available at https://github.com/OpenMatch/OpenMatch.
2013
Search
Fix author
Co-authors
- Zhenghao Liu (刘正皓) 21
- Yu Gu (谷峪) 16
- Yukun Yan (闫宇坤) 15
- Xinze Li 10
- Shuo Wang 9
- Maosong Sun (孙茂松) 7
- Zhiyuan Liu 5
- Zhipeng Xu 5
- Shi Yu (于是) 5
- Chenyan Xiong 4
- Minghe Yu 4
- Zulong Chen 3
- Pengcheng Huang 3
- Zhiyuan Liu 3
- Cheng Yang 3
- Zheni Zeng 3
- Shi Feng 2
- Yifan Ji 2
- Shuliang Liu 2
- Chunyi Peng 2
- Daling Wang 2
- Zhigang Wang 2
- Hanbin Wang 2
- Zhuoyang Wu 2
- Chaojun Xiao 2
- Huimin Chen 1
- Chong Chen 1
- Ganqu Cui 1
- Xin Dai 1
- Ning Ding 1
- Shaohua Duan 1
- Xue Geng 1
- Zhensheng Jin 1
- Fangling Leng 1
- Binyang Li 1
- Fan Li 1
- Fangfang Li 1
- Xiaohua Li 1
- Lizi Liao 1
- Junyang Lin 1
- Yibin Liu 1
- Sen Mei 1
- Furong Peng 1
- Feiliang Ren 1
- Haonan Shangguan 1
- Qi Shi 1
- Jintong Sun 1
- He Wang 1
- Ning Wang 1
- Shuo Wang 1
- Kam-Fai Wong 1
- Mingyan Wu 1
- Huiyuan Xie 1
- Haidong Xin 1
- Yuqi Xiong 1
- ZhiBo Yang 1
- Xiaocui Yang 1
- Liner Yang 1
- Weiqing Yang 1
- Sijia Yao 1
- Xiaoyuan Yi 1
- Le Zhang 1
- Qian Zhang 1
- Yun Zhang 1
- Yifei Zhang 1
- Tianshuo Zhou 1