IndiSocialFT: Multilingual Word Representation for Indian languages in code-mixed environment

Saurabh Kumar, Ranbir Sanasam, Sukumar Nandi


Abstract
The increasing number of Indian language users on the internet necessitates the development of Indian language technologies. In response to this demand, our paper presents a generalized representation vector for diverse text characteristics, including native scripts, transliterated text, multilingual, code-mixed, and social media-related attributes. We gather text from both social media and well-formed sources and utilize the FastText model to create the “IndiSocialFT” embedding. Through intrinsic and extrinsic evaluation methods, we compare IndiSocialFT with three popular pretrained embeddings trained over Indian languages. Our findings show that the proposed embedding surpasses the baselines in most cases and languages, demonstrating its suitability for various NLP applications.
Anthology ID:
2023.findings-emnlp.252
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3866–3871
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.252
DOI:
10.18653/v1/2023.findings-emnlp.252
Bibkey:
Cite (ACL):
Saurabh Kumar, Ranbir Sanasam, and Sukumar Nandi. 2023. IndiSocialFT: Multilingual Word Representation for Indian languages in code-mixed environment. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3866–3871, Singapore. Association for Computational Linguistics.
Cite (Informal):
IndiSocialFT: Multilingual Word Representation for Indian languages in code-mixed environment (Kumar et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2023.findings-emnlp.252.pdf