2025
pdf
bib
abs
Boosting Long-Context Information Seeking via Query-Guided Activation Refilling
Hongjin Qian
|
Zheng Liu
|
Peitian Zhang
|
Zhicheng Dou
|
Defu Lian
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Processing long contexts poses a significant challenge for large language models (LLMs) due to their inherent context window limitations and the computational burden of extensive key-value (KV) activations, which severely impact efficiency. For information-seeking tasks, full context perception is often unnecessary, as a query’s information needs can dynamically range from localized details to a global perspective, depending on its complexity. However, existing methods struggle to adapt effectively to this dynamic information needs.In the paper, we propose a method for processing long-context information-seeking tasks via query-guided ACtivation REfilling (ACRE). ACRE constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache compactly captures global information, and the layer-2 (L2) cache provides detailed, localized information. ACRE establishes a proxying relationship between the two caches, allowing the input query to attend to the L1 cache and dynamically refill it with relevant entries from the L2 cache. This mechanism integrates global understanding with query-specific local details, thereby enhancing answer decoding. Experiments on a variety of long-context information-seeking datasets demonstrate ACRE’s effectiveness, achieving significant improvements in both performance and efficiency.
pdf
bib
abs
MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval
Junjie Zhou
|
Yongping Xiong
|
Zheng Liu
|
Ze Liu
|
Shitao Xiao
|
Yueze Wang
|
Bo Zhao
|
Chen Jason Zhang
|
Defu Lian
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70× more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our code, synthesized dataset, and pre-trained models are publicly available at https://github.com/VectorSpaceLab/MegaPairs.
pdf
bib
abs
Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval
Zheng Liu
|
Ze Liu
|
Zhengyang Liang
|
Junjie Zhou
|
Shitao Xiao
|
Chao Gao
|
Chen Jason Zhang
|
Defu Lian
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called Visualized Information Retrieval, or Vis-IR, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called Screenshots, for various retrieval applications. We further make three key contributions for Vis-IR. First, we create VIRA (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and question-answer formats. Second, we develop UniSE (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct MVRB (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE. Our data, model and benchmark have been made publicly available, which lays a solid foundation for this emerging field.
pdf
bib
abs
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
Jianlyu Chen
|
Nan Wang
|
Chaofan Li
|
Bo Wang
|
Shitao Xiao
|
Han Xiao
|
Hao Liao
|
Defu Lian
|
Zheng Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.
pdf
bib
abs
Reinforced IR: A Self-Boosting Framework For Domain-Adapted Information Retrieval
Chaofan Li
|
Jianlyu Chen
|
Yingxia Shao
|
Chaozhuo Li
|
Quanqing Xu
|
Defu Lian
|
Zheng Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While retrieval techniques are widely used in practice, they still face significant challenges in cross-domain scenarios. Recently, generation-augmented methods have emerged as a promising solution to this problem. These methods enhance raw queries by incorporating additional information from an LLM-based generator, facilitating more direct retrieval of relevant documents. However, existing methods struggle with highly specialized situations that require extensive domain expertise. To address this problem, we present Reinforced-IR, a novel approach that jointly adapts a pre-trained retriever and generator for precise cross-domain retrieval. A key innovation of Reinforced-IR is its Self-Boosting framework, which enables retriever and generator to learn from each other’s feedback. Specifically, the generator is reinforced to generate query augmentations that enhance the retriever’s performance, while the retriever is trained to better discriminate the relevant documents identified by the generator. This iterative process allows the end-to-end retrieval performance to be progressively optimized using an unlabeled corpus from the target domain. In our experiment, Reinforced-IR outperforms existing domain adaptation methods by a large margin, leading to substantial improvements in retrieval quality across a wide range of application scenarios.We have publicly released our code at this repo.
pdf
bib
abs
Single LLM, Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization
Yutao Zhu
|
Jiajie Jin
|
Hongjin Qian
|
Zheng Liu
|
Zhicheng Dou
|
Ji-Rong Wen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Existing studies have optimized retrieval-augmented generation (RAG) across various sub-tasks, such as query understanding and retrieval refinement, but integrating these optimizations into a unified framework remains challenging. To tackle this problem, this work proposes RoleRAG, a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization. RoleRAG comprises six modules, each handling a specific sub-task within the RAG process. Additionally, we introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state. All modules are driven by the same underlying LLM, distinguished by task-specific role tokens that are individually optimized. This design allows RoleRAG to dynamically activate different modules within a single LLM instance, thereby streamlining deployment and reducing resource consumption. Experimental results on five open-domain question-answering datasets demonstrate the effectiveness, generalizability, and flexibility of our framework.
pdf
bib
abs
SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis
Shuang Sun
|
Huatong Song
|
Yuhao Wang
|
Ruiyang Ren
|
Jinhao Jiang
|
Junjie Zhang
|
Fei Bai
|
Jia Deng
|
Wayne Xin Zhao
|
Zheng Liu
|
Lei Fang
|
Zhongyuan Wang
|
Ji-Rong Wen
Findings of the Association for Computational Linguistics: EMNLP 2025
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our anonymous code is available at https://github.com/RUCAIBox/SimpleDeepSearcher
pdf
bib
abs
AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels
Lei Li
|
Xiangxu Zhang
|
Xiao Zhou
|
Zheng Liu
Findings of the Association for Computational Linguistics: EMNLP 2025
Medical information retrieval (MIR) is vital for accessing knowledge from electronic health records, scientific literature, and medical databases, supporting applications such as medical education, patient queries, and clinical diagnosis. However, effective zero-shot dense retrieval in the medical domain remains difficult due to the scarcity of relevance-labeled data. To address this challenge, we propose **S**elf-**L**earning **Hy**pothetical **D**ocument **E**mbeddings (**SL-HyDE**), a framework that leverages large language models (LLMs) to generate hypothetical documents conditioned on a query. These documents encapsulate essential medical context, guiding dense retrievers toward the most relevant results. SL-HyDE further employs a self-learning mechanism that iteratively improves pseudo-document generation and retrieval using unlabeled corpora, eliminating the need for labeled data. In addition, we introduce the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation suite reflecting real-world medical scenarios, comprising five tasks and ten datasets. By benchmarking ten models on CMIRB, we provide a rigorous standard for evaluating MIR systems. Experimental results demonstrate that SL-HyDE significantly outperforms HyDE in retrieval accuracy, while exhibiting strong generalization and scalability across diverse LLM and retriever configurations. Our code and data are publicly available at: https://github.com/ll0ruc/AutoMIR.
2024
pdf
bib
abs
ChatRetriever: Adapting Large Language Models for Generalized and Robust Conversational Dense Retrieval
Kelong Mao
|
Chenlong Deng
|
Haonan Chen
|
Fengran Mo
|
Zheng Liu
|
Tetsuya Sakai
|
Zhicheng Dou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Conversational search requires accurate interpretation of user intent from complex multi-turn contexts. This paper presents ChatRetriever, which inherits the strong generalization capability of large language models to robustly represent complex conversational sessions for dense retrieval. To achieve this, we propose a simple and effective dual-learning approach that adapts LLM for retrieval via contrastive learning while enhancing the complex session understanding through masked instruction tuning on high-quality conversational instruction tuning data. Extensive experiments on five conversational search benchmarks demonstrate that ChatRetriever significantly outperforms existing conversational dense retrievers, achieving state-of-the-art performance on par with LLM-based rewriting approaches. Furthermore, ChatRetriever exhibits superior robustness in handling diverse conversational contexts. Our work highlights the potential of adapting LLMs for retrieval with complex inputs like conversational search sessions and proposes an effective approach to advance this research direction.
pdf
bib
abs
Large Language Models as Foundations for Next-Gen Dense Retrieval: A Comprehensive Empirical Assessment
Kun Luo
|
Minghao Qin
|
Zheng Liu
|
Shitao Xiao
|
Jun Zhao
|
Kang Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Pre-trained language models like BERT and T5 serve as crucial backbone encoders for dense retrieval. However, these models often exhibit limited generalization capabilities and face challenges in improving in-domain accuracy. Recent research has explored using large language models (LLMs) as retrievers, achieving state-of-the-art performance across various tasks. Despite these advancements, the specific benefits of LLMs over traditional retrievers and the impact of different LLM configurations—such as parameter sizes, pre-training duration, and alignment processes—on retrieval tasks remain unclear. In this work, we conduct a comprehensive empirical study on a wide range of retrieval tasks, including in-domain accuracy, data efficiency, zero-shot generalization, lengthy retrieval, instruction-based retrieval, and multi-task learning. We evaluate over 15 different backbone LLMs and non-LLMs. Our findings reveal that larger models and extensive pre-training consistently enhance in-domain accuracy and data efficiency. Additionally, larger models demonstrate significant potential in zero-shot generalization, lengthy retrieval, instruction-based retrieval, and multi-task learning. These results underscore the advantages of LLMs as versatile and effective backbone encoders in dense retrieval, providing valuable insights for future research and development in this field.
pdf
bib
abs
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Jianlyu Chen
|
Shitao Xiao
|
Peitian Zhang
|
Kun Luo
|
Defu Lian
|
Zheng Liu
Findings of the Association for Computational Linguistics: ACL 2024
In this paper, we introduce a new embedding model called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It provides a uniform support for the semantic retrieval of more than 100 working languages. It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval. Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens. The effective training of M3-Embedding presents a series of technical contributions. Notably, we propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, which enables a large batch size and high training throughput to improve the discriminativeness of embeddings. M3-Embedding exhibits a superior performance in our experiment, leading to new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
pdf
bib
abs
LM-Cocktail: Resilient Tuning of Language Models via Model Merging
Shitao Xiao
|
Zheng Liu
|
Peitian Zhang
|
Xingrun Xing
Findings of the Association for Computational Linguistics: ACL 2024
The pre-trained language models are continually fine-tuned to better support downstream applications. However, this operation may result in significant performance degeneration on general tasks beyond the targeted domain. To overcome this problem, we propose LM-Cocktail which enables the fine-tuned model to stay resilient in general perspectives. Our method is conducted in the form of model merging, where the fine-tuned language model is merged with the pre-trained base model or the peer models from other domains through weighted average. Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain.
pdf
bib
abs
RAG-Studio: Towards In-Domain Adaptation of Retrieval Augmented Generation Through Self-Alignment
Kelong Mao
|
Zheng Liu
|
Hongjin Qian
|
Fengran Mo
|
Chenlong Deng
|
Zhicheng Dou
Findings of the Association for Computational Linguistics: EMNLP 2024
Retrieval-Augmented Generation (RAG) has proven to be an effective paradigm for enhancing the quality of text generation by integrating large language models (LLMs) with external knowledge. However, an off-the-shelf RAG system, which relies on generally pre-trained LLMs and retrievers, often falls short in specialized domains and applications. In this paper, we introduce RAG-Studio, an efficient self-aligned training framework to adapt general RAG models to specific domains solely through synthetic data, eliminating the need for expensive human-labeled in-domain data. RAG-Studio accepts a specialized domain corpus, a general LLM, and a general retriever, then autonomously generates contrastive training data for both the LLM and retriever through self-alignment. We fine-tune them to work cohesively as an integrated and effective domain-specific RAG system, where the LLM is adapted to incorporate new domain knowledge and become robust to noisy contexts, and the retriever learns to better align with the LLM’s preferences, providing more useful information and minimizing the risk of misleading the LLM. Extensive experiments across diverse in-domain question-answering datasets spanning the biomedical, finance, law, and computing domains, show that RAG-Studio attains state-of-the-art performance, consistently outperforming the use of human-annotated data for fine-tuning.
pdf
bib
abs
Negating Negatives: Alignment with Human Negative Samples via Distributional Dispreference Optimization
Shitong Duan
|
Xiaoyuan Yi
|
Peng Zhang
|
Yan Liu
|
Zheng Liu
|
Tun Lu
|
Xing Xie
|
Ning Gu
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) have revolutionized the role of AI, yet pose potential social risks. To steer LLMs towards human preference, alignment technologies have been introduced and gained increasing attention. Nevertheless, existing methods heavily rely on high-quality positive-negative training pairs, suffering from noisy positive responses that are barely distinguishable from negative ones. Given recent LLMs’ proficiency in generating helpful responses, this work pivots towards a new research question: **can we achieve alignment using solely human-annotated negative samples, preserving helpfulness while reducing harmfulness?** For this purpose, we propose Distributional Dispreference Optimization (D2O), which maximizes the discrepancy between dispreferred responses and the generated non-negative ones. In this way, D2O effectively eschews harmful information without incorporating noisy positive samples, while avoiding collapse using self-generated responses as anchors. We demonstrate that D2O can be regarded as learning a distributional preference model reflecting human dispreference against negative responses, which is theoretically an upper bound of the instance-level DPO. Extensive experiments manifest that our method achieves comparable generation quality and surpasses the latest strong baselines in producing less harmful and more informative responses with better training stability and faster convergence.