Yubao Tang


2024

pdf
Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval
Yubao Tang | Ruqing Zhang | Jiafeng Guo | Maarten Rijke | Yixing Fan | Xueqi Cheng
Findings of the Association for Computational Linguistics ACL 2024

Generative retrieval uses differentiable search indexes to directly generate relevant document identifiers in response to a query. Recent studies have highlighted the potential of a strong generative retrieval model, trained with carefully crafted pre-training tasks, to enhance downstream retrieval tasks via fine-tuning. However, the full power of pre-training for generative retrieval remains underexploited due to its reliance on pre-defined static document identifiers, which may not align with evolving model parameters. In this work, we introduce BootRet, a bootstrapped pre-training method for generative retrieval that dynamically adjusts document identifiers during pre-training to accommodate the continuing memorization of the corpus. BootRet involves three key training phases: (i) initial identifier generation, (ii) pre-training via corpus indexing and relevance prediction tasks, and (iii) bootstrapping for identifier updates. To facilitate the pre-training phase, we further introduce noisy documents and pseudo-queries, generated by large language models, to resemble semantic connections in both indexing and retrieval tasks. Experimental results demonstrate that BootRet significantly outperforms existing pre-training generative retrieval baselines and performs well even in zero-shot settings.

2023

pdf
生成式信息检索前沿进展与挑战(Challenges and Advances in Generative Information Retrieval)
Yixing Fan (意兴 范) | Yubao Tang (钰葆 唐) | Jiangui Chen (建贵 陈) | Ruqing Zhang (儒清 张) | Jiafeng Guo (嘉丰 郭)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum)

“信息检索(Information Retrieval, IR)旨在从大规模的语料集合中找到与用户查询相关的信息,已经成为人们解决日常工作和生活中问题的最重要工具之一。现有的IR系统主要依赖于“索引-召回-重排”的框架,将复杂的检索任务建模成多阶段耦合的搜索过程。这种解耦建模的方式,一方面提升了系统检索的效率,使得检索系统能够轻松应对数十亿的语料集合;另一方面也加重了系统架构的复杂性,无法实现端到端联合优化。为了应对这个问题,近年来研究人员开始探索利用一个统一的模型建模整个搜索过程,并提出了新的生成式信息检索范式,这种新的范式将整个语料集合编码到检索模型中,可以实现端到端优化,消除了检索系统对于外部索引的依赖。当前,生成式检索已经成为坉坒领域热门研究方向之一,研究人员提出了不同的方案来提升检索的效果,考虑到这个方向的快速进展,本文将对生成式信息检索进行系统的综述,包括基础概念,文档标识符和模型容量。此外,我们还讨论了一些未解决的挑战以及有前景的研究方向,希望能激发和促进更多关于这些主题的未来研究。”