Hildoc: Leveraging Hilbert Curve Representation for Accurate and Efficient Document Retrieval

Muhammad AL-Qurishi, Zhaozhi Qian, Faroq AL-Tam, Riad Souissi


Abstract
Document retrieval is a critical challenge in information retrieval systems, where the goal is to efficiently retrieve relevant documents in response to a given query. Dense retrieval methods, which utilize vector embeddings to represent semantic information, require effective indexing to ensure fast and accurate retrieval. Existing methods, such as MEVI, have attempted to address this by using hierarchical K-Means for clustering, but they often face limitations in computational efficiency and retrieval accuracy. In this paper, we introduce the Hildoc Index, a novel document indexing approach that leverages the Hilbert Curve to map document embeddings onto a one-dimensional space. This innovative representation facilitates efficient clustering using a 1D quantile-based algorithm, ensuring uniform partition sizes and preserving the inherent structure of the data. As a result, Hildoc Index not only reduces training complexity but also enhances retrieval accuracy and speed during inference. Our method can be seamlessly integrated into both dense retrieval systems and hybrid ensemble systems. Through comprehensive experiments on standard benchmarks like MSMARCO Passage and Natural Questions, we demonstrate that the Hildoc Index significantly outperforms the current state-of-the-art MEVI in terms of both retrieval speed and recall. These results underscore the Hildoc Index as a solution for fast and accurate dense document retrieval.
Anthology ID:
2025.ijcnlp-long.101
Volume:
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:
IJCNLP | AACL
SIG:
Publisher:
The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:
1863–1876
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.101/
DOI:
Bibkey:
Cite (ACL):
Muhammad AL-Qurishi, Zhaozhi Qian, Faroq AL-Tam, and Riad Souissi. 2025. Hildoc: Leveraging Hilbert Curve Representation for Accurate and Efficient Document Retrieval. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 1863–1876, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):
Hildoc: Leveraging Hilbert Curve Representation for Accurate and Efficient Document Retrieval (AL-Qurishi et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.101.pdf