Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval

Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, Jing Yao


Abstract
Inverted file structure is a common technique for accelerating dense retrieval. It clusters documents based on their embeddings; during searching, it probes nearby clusters w.r.t. an input query and only evaluates documents within them by subsequent codecs, thus avoiding the expensive cost from exhaustive traversal. However, the clustering is always lossy, which results in the miss of relevant documents in the probed clusters and hence degrades retrieval quality. In contrast, lexical matching, such as overlaps of salient terms, tend to be strong features for identifying relevant documents. In this work, we present the Hybrid Inverted Index (HI2), where the embedding clusters and salient terms work collaboratively to accelerate dense retrieval. To make best of both effectiveness and efficiency, we devise a cluster selector and a term selector, to construct compact inverted lists and efficiently searching through them. Moreover, we leverage simple unsupervised algorithms as well as end-to-end knowledge distillation to learn these two modules, with the latter further boosting the effectiveness. Based on comprehensive experiments on popular retrieval benchmarks, we verify that clusters and terms indeed complement each other, enabling HI2 to achieve lossless retrieval quality with competitive efficiency across a variety of index settings.
Anthology ID:
2023.emnlp-main.116
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1877–1888
Language:
URL:
https://aclanthology.org/2023.emnlp-main.116
DOI:
10.18653/v1/2023.emnlp-main.116
Bibkey:
Cite (ACL):
Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, and Jing Yao. 2023. Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1877–1888, Singapore. Association for Computational Linguistics.
Cite (Informal):
Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval (Zhang et al., EMNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2023.emnlp-main.116.pdf
Video:
 https://preview.aclanthology.org/naacl24-info/2023.emnlp-main.116.mp4