Pamir Gogoi
2025
ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages
Neha Joshi
|
Pamir Gogoi
|
AasimBaig Mirza
|
Aayush Jansari
|
Aditya Yadavalli
|
Ayushi Pandey
|
Arunima Shukla
|
Deepthi Sudharsan
|
Kalika Bali
|
Vivek Seshadri
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000—captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models’ capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context—including background information about the languages, translation examples, and guidelines for cultural preservation—leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.
2024
MunTTS: A Text-to-Speech System for Mundari
Varun Gumma
|
Rishav Hada
|
Aditya Yadavalli
|
Pamir Gogoi
|
Ishani Mondal
|
Vivek Seshadri
|
Kalika Bali
Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages
We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to build a speech synthesis system. We begin our study by gathering a substantial dataset of Mundari text and speech and train end-to-end speech models. We also delve into the methods used for training our models, ensuring they are efficient and effective despite the data constraints. We evaluate our system with native speakers and objective metrics, demonstrating its potential as a tool for preserving and promoting the Mundari language in the digital age.
Search
Fix author
Co-authors
- Kalika Bali 2
- Vivek Seshadri 2
- Aditya Yadavalli 2
- Varun Gumma 1
- Rishav Hada 1
- show all...