On Evaluation of Bangla Word Analogies

Mousumi Akter, Souvika Sarkar, Shubhra Kanti Karmaker Santu


Abstract
This paper presents a benchmark dataset of Bangla word analogies for evaluating the quality of existing Bangla word embeddings. Despite being the 7th largest spoken language in the world, Bangla is still a low-resource language and popular NLP models often struggle to perform well on Bangla data sets. Therefore, developing a robust evaluation set is crucial for benchmarking and guiding future research on improving Bangla word embeddings, which is currently missing. To address this issue, we introduce a new evaluation set of 16,678 unique word analogies in Bangla as well as a translated and curated version of the original Mikolov dataset (10,594 samples) in Bangla. Our experiments with different state-of-the-art embedding models reveal that current Bangla word embeddings struggle to achieve high accuracy on both data sets, demonstrating a significant gap in multilingual NLP research.
Anthology ID:
2023.emnlp-main.811
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13121–13127
Language:
URL:
https://aclanthology.org/2023.emnlp-main.811
DOI:
10.18653/v1/2023.emnlp-main.811
Bibkey:
Cite (ACL):
Mousumi Akter, Souvika Sarkar, and Shubhra Kanti Karmaker Santu. 2023. On Evaluation of Bangla Word Analogies. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13121–13127, Singapore. Association for Computational Linguistics.
Cite (Informal):
On Evaluation of Bangla Word Analogies (Akter et al., EMNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2023.emnlp-main.811.pdf
Video:
 https://preview.aclanthology.org/naacl24-info/2023.emnlp-main.811.mp4