Bhavana Nali
2026
Cascaded Modular or End-to-End? : An Investigation on Speech-to-Speech Translation Task for Dravidian Languages
Bhavana Nali | Abhik Jana
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Bhavana Nali | Abhik Jana
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
This paper presents a study of speech-to-speech translation for low-resource Dravidian languages, focusing on Tamil, Telugu, and Kannada. We investigate the efficacy of the Cascaded Modular system with the End-to-end system in both zero-shot and fine-tuned settings. The Cascaded Modular approach combines an ASR Module (Whisper-based ASR for English speech; IndicConformer for Dravidian speech), a Text-to-Text translation module (IndicTrans2), and a Speech synthesis module (Indic Parler-TTS), whereas SeamlessM4T is used as the End-to-end system. For parameter-efficient Low-Rank Adaptation (LoRA) fine-tuning to adapt the translation component to a domain-specific dataset, we use FLEURS and Mann-ki-Baat (a subset of BhasaAnuvaad dataset). Cascaded Modular systems achieve BLEU scores ranging from 3.17 to 19.18 in the zero-shot setting and 5.08 to 19.18 after fine-tuning, whereas the End-to-end model ranges from 3.02 to 15.72 in zero-shot settings across languages and 4.11 to 16.84 after fine-tuning. The results show that Cascaded Modular systems consistently outperform the End-to-end model across both setups. Note that parameter-efficient fine-tuning yields significant improvements in translation quality and speech generation performance for low-resource Dravidian speech translation.