Evaluating Performance of Pre-trained Word Embeddings on Assamese, a Low-resource Language

Dhrubajyoti Pathak, Sukumar Nandi, Priyankoo Sarmah


Abstract
Word embeddings and Language models are the building blocks of modern Deep Neural Network-based Natural Language Processing. They are extensively explored in high-resource languages and provide state-of-the-art (SOTA) performance for a wide range of downstream tasks. Nevertheless, these word embeddings are not explored in languages such as Assamese, where resources are limited. Furthermore, there has been limited study into the performance evaluation of these word embeddings for low-resource languages in downstream tasks. In this research, we explore the current state of Assamese pre-trained word embeddings. We evaluate these embeddings’ performance on sequence labeling tasks such as Parts-of-speech and Named Entity Recognition. In order to assess the efficiency of the embeddings, experiments are performed utilizing both ensemble and individual word embedding approaches. The ensembling approach that uses three word embeddings outperforms the others. In the paper, the outcomes of the investigations are described. The results of this comparative performance evaluation may assist researchers in choosing an Assamese pre-trained word embedding for subsequent tasks.
Anthology ID:
2024.lrec-main.568
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
6418–6425
Language:
URL:
https://aclanthology.org/2024.lrec-main.568
DOI:
Bibkey:
Cite (ACL):
Dhrubajyoti Pathak, Sukumar Nandi, and Priyankoo Sarmah. 2024. Evaluating Performance of Pre-trained Word Embeddings on Assamese, a Low-resource Language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 6418–6425, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Evaluating Performance of Pre-trained Word Embeddings on Assamese, a Low-resource Language (Pathak et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2024.lrec-main.568.pdf