LexGen: Domain-aware Multilingual Lexicon Generation
Ayush Maheshwari, Atul Kumar Singh, N J Karthika, Krishnakant Bhatt, Preethi Jyothi, Ganesh Ramakrishnan
Abstract
Lexicon or dictionary generation across domains has the potential for societal impact, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping-based or corpora-based approaches. However, these approaches do not cater to domain-specific lexicon generation that consists of domain-specific terminology. This task becomes particularly important in specialized medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and scarcity of data involving domain-specific terms especially for low-resource languages. We propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. We also release a new benchmark dataset consisting of >75K translation pairs across 6 Indian languages spanning 8 diverse domains. We conduct both zero-shot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages. Additionally, we also perform a human post-hoc evaluation on unseen languages. The source code and dataset is present at https://github.com/Atulkmrsingh/lexgen.- Anthology ID:
- 2025.acl-long.365
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7364–7375
- Language:
- URL:
- https://preview.aclanthology.org/landing_page/2025.acl-long.365/
- DOI:
- Cite (ACL):
- Ayush Maheshwari, Atul Kumar Singh, N J Karthika, Krishnakant Bhatt, Preethi Jyothi, and Ganesh Ramakrishnan. 2025. LexGen: Domain-aware Multilingual Lexicon Generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7364–7375, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- LexGen: Domain-aware Multilingual Lexicon Generation (Maheshwari et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/landing_page/2025.acl-long.365.pdf