LEMUR: Robust Fine-Tuning for Multilingual Embedding Models for Retrieval
Narges Baba Ahmadi, Jan Strich, Martin Semmann, Chris Biemann
Abstract
Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We further propose the Lexical Content Score (LCS), a language-agnostic metric that quantifies the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions. Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code[GitHub Repository] and data[Hugging Face Dataset].- Anthology ID:
- 2026.eacl-srw.18
- Volume:
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Selene Baez Santamaria, Sai Ashish Somayajula, Atsuki Yamaguchi
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 248–265
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.18/
- DOI:
- Cite (ACL):
- Narges Baba Ahmadi, Jan Strich, Martin Semmann, and Chris Biemann. 2026. LEMUR: Robust Fine-Tuning for Multilingual Embedding Models for Retrieval. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 248–265, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- LEMUR: Robust Fine-Tuning for Multilingual Embedding Models for Retrieval (Ahmadi et al., EACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.18.pdf