Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval
Junhan Yang, Zheng Liu, Chaozhuo Li, Guangzhong Sun, Xing Xie
Abstract
Pre-trained language models (PLMs) have achieved the preeminent position in dense retrieval due to their powerful capacity in modeling intrinsic semantics. However, most existing PLM-based retrieval models encounter substantial computational costs and are infeasible for processing long documents. In this paper, a novel retrieval model Longtriever is proposed to embrace three core challenges of long document retrieval: substantial computational cost, incomprehensive document understanding, and scarce annotations. Longtriever splits long documents into short blocks and then efficiently models the local semantics within a block and the global context semantics across blocks in a tightly-coupled manner. A pre-training phase is further proposed to empower Longtriever to achieve a better understanding of underlying semantic correlations. Experimental results on two popular benchmark datasets demonstrate the superiority of our proposal.- Anthology ID:
- 2023.emnlp-main.223
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3655–3665
- Language:
- URL:
- https://aclanthology.org/2023.emnlp-main.223
- DOI:
- 10.18653/v1/2023.emnlp-main.223
- Cite (ACL):
- Junhan Yang, Zheng Liu, Chaozhuo Li, Guangzhong Sun, and Xing Xie. 2023. Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3655–3665, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval (Yang et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/landing_page/2023.emnlp-main.223.pdf