2025
pdf
bib
abs
Towards Unified Processing of Perso-Arabic Scripts for ASR
Srihari Bandarupalli
|
Bhavana Akkiraju
|
Sri Charan Devarakonda
|
Harinie Sivaramasethu
|
Vamshiraghusimha Narasinga
|
Anil Vuppala
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
Automatic Speech Recognition (ASR) systems for morphologically complex languages like Urdu, Persian, and Arabic face unique challenges due to the intricacies of Perso-Arabic scripts. Conventional data processing methods often fall short in effectively handling these languages’ phonetic and morphological nuances. This paper introduces a unified data processing pipeline tailored specifically for Perso-Arabic languages, addressing the complexities inherent in these scripts. The proposed pipeline encompasses comprehensive steps for data cleaning, tokenization, and phonemization, each of which has been meticulously evaluated and validated by expert linguists. Through expert-driven refinements, our pipeline presents a robust foundation for advancing ASR performance across Perso-Arabic languages, supporting the development of more accurate and linguistically informed multilingual ASR systems in future.
pdf
bib
abs
TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation
Bhavana Akkiraju
|
Srihari Bandarupalli
|
Swathi Sambangi
|
R Vijaya Saraswathi
|
Dr Vasavi Ravuri
|
Anil Vuppala
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu-English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, fine-tuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. While our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu–English translation. The work delivers three key contributions: a reproducible Telugu–English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.
pdf
bib
abs
Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data
Srihari Bandarupalli
|
Bhavana Akkiraju
|
Sri Charan D
|
Vamshi Raghu Simha Narasinga
|
Anil Vuppala
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.
pdf
bib
abs
StuD: A Multimodal Approach for Stuttering Detection with RAG and Fusion Strategies
Pragya Khanna
|
Priyanka Kommagouni
|
Vamshi Raghu Simha Narasinga
|
Anil Vuppala
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Stuttering is a complex speech disorder that challenges both ASR systems and clinical assessment. We propose a multimodal stuttering detection and classification model that integrates acoustic and linguistic features through a two-stage fusion mechanism. Fine-tuned Wav2Vec 2.0 and HuBERT extract acoustic embeddings, which are early fused with MFCC features to capture fine-grained spectral and phonetic variations, while Llama-2 embeddings from Whisper ASR transcriptions provide linguistic context. To enhance robustness against out-of-distribution speech patterns, we incorporate Retrieval-Augmented Generation or adaptive classification. Our model achieves state-of-the-art performance on SEP-28k and FluencyBank, demonstrating significant improvements in detecting challenging stuttering events. Additionally, our analysis highlights the complementary nature of acoustic and linguistic modalities, reinforcing the need for multimodal approaches in speech disorder detection.
pdf
bib
abs
IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation
Bhavana Akkiraju
|
Aishwarya Pothula
|
Santosh Kesiraju
|
Anil Vuppala
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
This paper presents the submission of IIITH-BUT to the IWSLT 2025 shared task on speech translation for the low-resource Bhojpuri-Hindi language pair. We explored the impact of hyperparameter optimisation and data augmentation techniques on the performance of the SeamlessM4T model fine-tuned for this specific task. We systematically investigated a range of hyperparameters including learning rate schedules, number of update steps, warm-up steps, label smoothing, and batch sizes; and report their effect on translation quality. To address data scarcity, we applied speed perturbation and SpecAugment and studied their effect on translation quality. We also examined the use of cross-lingual signal through joint training with Marathi and Bhojpuri speech data. Our experiments reveal that careful selection of hyperparameters and the application of simple yet effective augmentation techniques significantly improve performance in low-resource settings. We also analysed the translation hypotheses to understand various kinds of errors that impacted the translation quality in terms of BLEU
2022
pdf
bib
abs
Exploring the Effect of Dialect Mismatched Language Models in Telugu Automatic Speech Recognition
Aditya Yadavalli
|
Ganesh Sai Mirishkar
|
Anil Vuppala
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
Previous research has found that Acoustic Models (AM) of an Automatic Speech Recognition (ASR) system are susceptible to dialect variations within a language, thereby adversely affecting the ASR. To counter this, researchers have proposed to build a dialect-specific AM while keeping the Language Model (LM) constant for all the dialects. This study explores the effect of dialect mismatched LM by considering three different Telugu regional dialects: Telangana, Coastal Andhra, and Rayalaseema. We show that dialect variations that surface in the form of a different lexicon, grammar, and occasionally semantics can significantly degrade the performance of the LM under mismatched conditions. Therefore, this degradation has an adverse effect on the ASR even when dialect-specific AM is used. We show a degradation of up to 13.13 perplexity points when LM is used under mismatched conditions. Furthermore, we show a degradation of over 9% and over 15% in Character Error Rate (CER) and Word Error Rate (WER), respectively, in the ASR systems when using mismatched LMs over matched LMs.