Efficient Multilingual Text Classification for Indian Languages

Salil Aggarwal, Sourav Kumar, Radhika Mamidi


Abstract
India is one of the richest language hubs on the earth and is very diverse and multilingual. But apart from a few Indian languages, most of them are still considered to be resource poor. Since most of the NLP techniques either require linguistic knowledge that can only be developed by experts and native speakers of that language or they require a lot of labelled data which is again expensive to generate, the task of text classification becomes challenging for most of the Indian languages. The main objective of this paper is to see how one can benefit from the lexical similarity found in Indian languages in a multilingual scenario. Can a classification model trained on one Indian language be reused for other Indian languages? So, we performed zero-shot text classification via exploiting lexical similarity and we observed that our model performs best in those cases where the vocabulary overlap between the language datasets is maximum. Our experiments also confirm that a single multilingual model trained via exploiting language relatedness outperforms the baselines by significant margins.
Anthology ID:
2021.ranlp-1.3
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
19–25
Language:
URL:
https://aclanthology.org/2021.ranlp-1.3
DOI:
Bibkey:
Cite (ACL):
Salil Aggarwal, Sourav Kumar, and Radhika Mamidi. 2021. Efficient Multilingual Text Classification for Indian Languages. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 19–25, Held Online. INCOMA Ltd..
Cite (Informal):
Efficient Multilingual Text Classification for Indian Languages (Aggarwal et al., RANLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2021.ranlp-1.3.pdf