LexGen: Domain-aware Multilingual Lexicon Generation

Ayush Maheshwari, Atul Kumar Singh, N J Karthika, Krishnakant Bhatt, Preethi Jyothi, Ganesh Ramakrishnan


Abstract
Lexicon or dictionary generation across domains has the potential for societal impact, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping-based or corpora-based approaches. However, these approaches do not cater to domain-specific lexicon generation that consists of domain-specific terminology. This task becomes particularly important in specialized medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and scarcity of data involving domain-specific terms especially for low-resource languages. We propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. We also release a new benchmark dataset consisting of >75K translation pairs across 6 Indian languages spanning 8 diverse domains. We conduct both zero-shot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages. Additionally, we also perform a human post-hoc evaluation on unseen languages. The source code and dataset is present at https://github.com/Atulkmrsingh/lexgen.
Anthology ID:
2025.acl-long.365
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7364–7375
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.acl-long.365/
DOI:
Bibkey:
Cite (ACL):
Ayush Maheshwari, Atul Kumar Singh, N J Karthika, Krishnakant Bhatt, Preethi Jyothi, and Ganesh Ramakrishnan. 2025. LexGen: Domain-aware Multilingual Lexicon Generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7364–7375, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
LexGen: Domain-aware Multilingual Lexicon Generation (Maheshwari et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.acl-long.365.pdf