Rupak Raj Ghimire

2025

pdf bib abs
Improving Accuracy of Low-resource ASR using Rule-Based Character Constituency Loss (RBCCL)
Rupak Raj Ghimire | Prakash Poudyal | Bal Krishna Bal
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

Modern general-purpose speech recognition systems are more robust in languages with high resources. However, achieving state-of-the-art accuracy for low-resource languages is still challenging. To deal with this challenge, one of the popular practice is fine-tuning the pre-trained model on low-resource settings. Nevertheless, pre-trained or fine-tuned model fails to capture the complex character and word constituency in the Devanagari script transcription. We proposed a complementary loss function designed to force the model to learn the character constituency of Devanagari script. Our complementary loss function, called as Rule-Based Character Constituency Loss (RBCCL), that penalizes incorrect transcriptions and updates the overall loss during the model training phase. This loss function can be combined with CTC loss or cross-entropy loss as well which are widely used in ASR training. Our experiment shows that combining the existing cross-entropy loss with new complementary loss (RBCCL) improves the Word Error Rate (WER ), reducing it from 47.1% to 23.41% which is quite promising result.

pdf bib abs
Speech Personalization using Parameter Efficient Fine-Tuning for Nepali Speakers
Kiran Pantha | Rupak Raj Ghimire | Bal Krishna Bal
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

The performance of Automatic Speech Recognition (ASR) systems has improved significantly, driven by advancements in large-scale pre-trained models. However, adapting such models to low-resource languages such as Nepali is challenging due to the lack of labeled data and computational resources. Additionally, adapting the unique speech parameters of the speaker to a model is also a challenging task. Personalization helps to target the model to fit the particular speaker. This work investigates parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA) and Decomposed Weight Low-Rank Adaptation (DoRA) to improve the performance of fine-tuned Whisper ASR models for Nepali ASR tasks by Personalization. These experiments demonstrate that the PEFT methods obtain competitive results while significantly reducing the number of trainable parameters compared to full fine-tuning. LoRA and DoRA show a relative WER to FT_Base increment of 34.93% and 36.79%, respectively, and a relative CER to FT_Base increment of 49.50% and 50.03%, respectively. Furthermore, the results highlight a 99.74% reduction in total training parameters.

2024

pdf bib abs
Improving on the Limitations of the ASR Model in Low-Resourced Environments Using Parameter-Efficient Fine-Tuning
Rupak Raj Ghimire | Prakash Poudyal | Bal Krishna Bal
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Modern general-purpose speech recognition systems are more robust in languages with high resources. In contrast, achieving state-of-the-art accuracy for low-resource languages is still challenging. The fine-tuning of the pre-trained model is one of the highly popular practices which utilizes the existing information while efficiently learning from a small amount of data to enhance the precision and robustness of speech recognition tasks. This work attempts to diagnose the performance of a pre-trained model when transcribing the audio from the low-resource language. In this work, we apply an adapter-based iterative parameter-efficient fine-tuning strategy on a limited dataset aiming to improve the quality of the transcription of a previously fine-tuned model. For the experiment we used Whisper’s multilingual pre-trained speech model and Nepali as a test language. Using this approach we achieved Word Error Rate of 27.9%,which is more than 19% improvement over pre-trained Whisper Large − V2.

2023

pdf bib abs
Pronunciation-Aware Syllable Tokenizer for Nepali Automatic Speech Recognition System
Rupak Raj Ghimire | Bal Krishna Bal | Balaram Prasain | Prakash Poudyal
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

The Automatic Speech Recognition (ASR) has come up with significant advancements over the course of several decades, transitioning from a rule-based method to a statistical approach, and ultimately to the use of end-to-end (E2E) frameworks. This phenomenon continues with the progression of machine learning and deep learning methodologies. The E2E approach for ASR has demonstrated predominant success in the case of resourceful languages with larger annotated corpus. However, the accuracy is quite low for low-resourced languages such as Nepali. In this regard, language-specific tools such as tokenizers seem to play a vital role in improving the performance of the E2E model for low-resourced languages like Nepali. In this paper, we propose a pronunciationaware syllable tokenizer for the Nepali language which improves the results of the E2E model. Our experiment confirm that the introduction of the proposed tokenizer yields better performance with the Character Error Rate (CER) 8.09% compared to other language-independent tokenizers.

pdf bib abs
Active Learning Approach for Fine-Tuning Pre-Trained ASR Model for a Low-Resourced Language: A Case Study of Nepali
Rupak Raj Ghimire | Bal Krishna Bal | Prakash Poudyal
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Fine tuning of the pre-trained language model is a technique which can be used to enhance the technologies of low-resourced languages. The unsupervised approach can fine-tune any pre-trained model with minimum or even no language-specific resources. It is highly advantageous, particularly for languages that possess limited computational resources. We present a novel approach for fine-tuning a pre-trained Automatic Speech Recognition (ASR) model that is suitable for low resource languages. Our methods involves iterative fine-tuning of pre-trained ASR model. mms-1b is selected as the pretrained seed model for fine-tuning. We take the Nepali language as a case study for this research work. Our approach achieved a CER of 6.77%, outperforming all previously recorded CER values for the Nepali ASR Systems.

Co-authors

Venues

Fix data