Sriram Satkirti Purighella
2025
SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods
Roksana Goworek
|
Harpal Singh Karlcut
|
Hamza Shezad
|
Nijaguna Darshana
|
Abhishek Mane
|
Syam Bondada
|
Raghav Sikka
|
Ulvi Mammadov
|
Rauf Allahverdiyev
|
Sriram Satkirti Purighella
|
Paridhi Gupta
|
Muhinyia Ndegwa
|
Bao Khanh Tran
|
Haim Dubossarsky
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning nine low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.