Habeeb Shopeju


2025

pdf bib
Improving BGE-M3 Multilingual Dense Embeddings for Nigerian Low Resource Languages
Abdulmatin Omotoso | Habeeb Shopeju | Adejumobi Monjolaoluwa Joshua | Shiloh Oni
Proceedings of the 9th Widening NLP Workshop

Multilingual dense embedding models such as Multilingual E5, LaBSE, and BGE-M3 have shown promising results on diverse benchmarks for information retrieval in low-resource languages. But their result on low resource languages is not up to par with other high resource languages. This work improves the performance of BGE-M3 through contrastive fine-tuning; the model was selected because of its superior performance over other multilingual embedding models across MIRACL, MTEB, and SEB benchmarks. To fine-tune this model, we curated a comprehensive dataset comprising Yorùbá (32.9k rows), Igbo (18k rows) and Hausa (85k rows) from mainly news sources. We further augmented our multilingual dataset with English queries and mapped it to each of the Yoruba, Igbo, and Hausa documents, enabling cross-lingual semantic training. We evaluate on two settings: the Wura test set and the MIRACL benchmark. On Wura, the fine-tuned BGE-M3 raises mean reciprocal rank (MRR) to 0.9201 for Yorùbá, 0.8638 for Igbo, 0.9230 for Hausa, and 0.8617 for English queries matched to local documents, surpassing the BGE-M3 baselines of 0.7846, 0.7566, 0.8575, and 0.7377, respectively. On MIRACL (Yorùbá subset), the fine-tuned model attains 0.5996 MRR, slightly surpassing base BGE-M3 (0.5952) and outperforming ML-E5-large (0.5632) and LaBSE (0.4468).