Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval

Junhan Yang; Zheng Liu; Chaozhuo Li; Guangzhong Sun; Xing Xie

doi:10.18653/v1/2023.emnlp-main.223

Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval

Junhan Yang, Zheng Liu, Chaozhuo Li, Guangzhong Sun, Xing Xie

Abstract

Pre-trained language models (PLMs) have achieved the preeminent position in dense retrieval due to their powerful capacity in modeling intrinsic semantics. However, most existing PLM-based retrieval models encounter substantial computational costs and are infeasible for processing long documents. In this paper, a novel retrieval model Longtriever is proposed to embrace three core challenges of long document retrieval: substantial computational cost, incomprehensive document understanding, and scarce annotations. Longtriever splits long documents into short blocks and then efficiently models the local semantics within a block and the global context semantics across blocks in a tightly-coupled manner. A pre-training phase is further proposed to empower Longtriever to achieve a better understanding of underlying semantic correlations. Experimental results on two popular benchmark datasets demonstrate the superiority of our proposal.

Anthology ID:: 2023.emnlp-main.223
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3655–3665
Language:
URL:: https://aclanthology.org/2023.emnlp-main.223
DOI:: 10.18653/v1/2023.emnlp-main.223
Bibkey:
Cite (ACL):: Junhan Yang, Zheng Liu, Chaozhuo Li, Guangzhong Sun, and Xing Xie. 2023. Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3655–3665, Singapore. Association for Computational Linguistics.
Cite (Informal):: Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval (Yang et al., EMNLP 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2023.emnlp-main.223.pdf
Video:: https://preview.aclanthology.org/landing_page/2023.emnlp-main.223.mp4

PDF Search Video