Boosting Data Utilization for Multilingual Dense Retrieval

Chao Huang, Fengran Mo, Yufeng Chen, Changhao Guan, Zhenrui Yue, Xinyu Wang, Jinan Xu, Kaiyu Huang


Abstract
Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.
Anthology ID:
2025.emnlp-main.624
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12373–12389
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.624/
DOI:
Bibkey:
Cite (ACL):
Chao Huang, Fengran Mo, Yufeng Chen, Changhao Guan, Zhenrui Yue, Xinyu Wang, Jinan Xu, and Kaiyu Huang. 2025. Boosting Data Utilization for Multilingual Dense Retrieval. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12373–12389, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Boosting Data Utilization for Multilingual Dense Retrieval (Huang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.624.pdf
Checklist:
 2025.emnlp-main.624.checklist.pdf