Nepali Encoder Transformers: An Analysis of Auto Encoding Transformer Language Models for Nepali Text Classification
Utsav Maskey, Manish Bhatta, Shiva Bhatt, Sanket Dhungel, Bal Krishna Bal
Abstract
Language model pre-training has significantly impacted NLP and resulted in performance gains on many NLP-related tasks, but comparative study of different approaches on many low-resource languages seems to be missing. This paper attempts to investigate appropriate methods for pretraining a Transformer-based model for the Nepali language. We focus on the language-specific aspects that need to be considered for modeling. Although some language models have been trained for Nepali, the study is far from sufficient. We train three distinct Transformer-based masked language models for Nepali text sequences: distilbert-base (Sanh et al., 2019) for its efficiency and minuteness, deberta-base (P. He et al., 2020) for its capability of modeling the dependency of nearby token pairs and XLM-ROBERTa (Conneau et al., 2020) for its capabilities to handle multilingual downstream tasks. We evaluate and compare these models with other Transformer-based models on a downstream classification task with an aim to suggest an effective strategy for training low-resource language models and their fine-tuning.- Anthology ID:
- 2022.sigul-1.14
- Volume:
- Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Maite Melero, Sakriani Sakti, Claudia Soria
- Venue:
- SIGUL
- SIG:
- SIGUL
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 106–111
- Language:
- URL:
- https://aclanthology.org/2022.sigul-1.14
- DOI:
- Cite (ACL):
- Utsav Maskey, Manish Bhatta, Shiva Bhatt, Sanket Dhungel, and Bal Krishna Bal. 2022. Nepali Encoder Transformers: An Analysis of Auto Encoding Transformer Language Models for Nepali Text Classification. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 106–111, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Nepali Encoder Transformers: An Analysis of Auto Encoding Transformer Language Models for Nepali Text Classification (Maskey et al., SIGUL 2022)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2022.sigul-1.14.pdf