Sundeep Dawadi

2026

Nepali Lemmatization with Multilingual Transformers: Intrinsic and Extrinsic Evaluation in a Low-Resource Setting
Sunil Regmi | Sundeep Dawadi | Bal Krishna Bal
Proceedings of the Fifteenth Language Resources and Evaluation Conference

The Nepali language has a rich and complex morphology. Existing lemmatization research focuses on traditional rule-based or TRIE-based approaches. These methods often fail when encountering out-of-vocabulary or misspelled words. This paper investigates neural lemmatization for the under-resourced Nepali language using multilingual transformer models. We formulate lemmatization as a text-to-text generation problem and evaluate its impacts on downstream tasks by finetuning mBART-large-50, mT5-base, and mT5-small. The models were trained on a combination of publicly available and human-annotated word-lemma pair (8,000 instances) dataset. The performance is evaluated using Character Error Rate (CER), accuracy, character-level Bilingual Evaluation Understudy (BLEU), and morphological coverage. The mT5-base model achieved the highest overall performance. The model achieved 96.1% accuracy and a 1.1% CER using a learning rate of 5 × 10−4. However, it showed slightly weaker performance in handling complex morphological variations. The mBART-large-50 model followed closely with 96.0% accuracy and 0.970 morphological coverage. To assess the efficacy of these models, we applied lemmatization to downstream tasks. In Hindi-Nepali cross-lingual alignment, performance improved significantly from 12.86% to 41.61% using mBART model. In information retrieval, the Mean Average Precision (MAP)@1 using binary index increased from 0.71 to 0.90 using mBART model. These results demonstrate that multilingual transformers effectively learn morphological transformations for low-resource languages through text-to-text generation.

2025

pdf bib abs

Paramananda@NLU of Devanagari Script Languages 2025: Detection of Language, Hate Speech and Targets using FastText and BERT
Darwin Acharya | Sundeep Dawadi | Shivram Saud | Sunil Regmi
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

This paper presents a comparative analysis of FastText and BERT-based approaches for Natural Language Understanding (NLU) tasks in Devanagari script languages. We evaluate these models on three critical tasks: language identification, hate speech detection, and target identification across five languages: Nepali, Marathi, Sanskrit, Bhojpuri, and Hindi. Our experiments, although with raw tweet dataset but extracting only devanagari script, demonstrate that while both models achieve exceptional performance in language identification (F1 scores > 0.99), they show varying effectiveness in hate speech detection and target identification tasks. FastText with augmented data outperforms BERT in hate speech detection (F1 score: 0.8552 vs 0.5763), while BERT shows superior performance in target identification (F1 score: 0.5785 vs 0.4898). These findings contribute to the growing body of research on NLU for low-resource languages and provide insights into model selection for specific tasks in Devanagari script processing.

Co-authors

Venues

Fix author