Ayushi Pandey
2026
Same-Language Subtitles for Low-resource Languages: A Case of Bundelkhandi
Anirudh Pradhan | Ayushi Pandey | Divyansh Kushwaha | Akshita Tiwary | Vivek Seshadri
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Anirudh Pradhan | Ayushi Pandey | Divyansh Kushwaha | Akshita Tiwary | Vivek Seshadri
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Same-language subtitles enhance consumers’ experience for audiovisual content for both hearing impaired population. However, while high-resource languages can benefit from automatic subtitling, subtitles are seldom available for content creators in regional languages. This limits audience engagement on their content, which often is independently produced. This paper presents Project Saurakhi, a platform for generating same-language subtitles in regional languages. To achieve this, we first extract community-generated YouTube videos serve as the primary data source for this project. The current dataset comprises 63 hours of Bundelkhandi speech sourced from 207 YouTube videos across 19 content creators. And second, the technical workflow integrates automated stages with manual refinement via a mobile annotation platform. As regional language content grows both in independent productions, and in over-the-top platforms, Project Saurakhi aims to train women participants in rural India to become proficient in providing subtitles in their native languages. corpus creation, low-resource languages, Bundelkhandi, Indian languages, conversational AI, speech recognition, YouTube data
2025
ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages
Neha Joshi | Pamir Gogoi | AasimBaig Mirza | Aayush Jansari | Aditya Yadavalli | Ayushi Pandey | Arunima Shukla | Deepthi Sudharsan | Kalika Bali | Vivek Seshadri
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Neha Joshi | Pamir Gogoi | AasimBaig Mirza | Aayush Jansari | Aditya Yadavalli | Ayushi Pandey | Arunima Shukla | Deepthi Sudharsan | Kalika Bali | Vivek Seshadri
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000—captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models’ capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context—including background information about the languages, translation examples, and guidelines for cultural preservation—leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.
2018
Phonetically Balanced Code-Mixed Speech Corpus for Hindi-English Automatic Speech Recognition
Ayushi Pandey | Brij Mohan Lal Srivastava | Rohit Kumar | Bhanu Teja Nellore | Kasi Sai Teja | Suryakanth V. Gangashetty
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Ayushi Pandey | Brij Mohan Lal Srivastava | Rohit Kumar | Bhanu Teja Nellore | Kasi Sai Teja | Suryakanth V. Gangashetty
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
Towards developing a phonetically balanced code-mixed speech corpus for Hindi-English ASR
Ayushi Pandey | Brij Mohan Lal Srivastava | Suryakanth Gangashetty
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)
Ayushi Pandey | Brij Mohan Lal Srivastava | Suryakanth Gangashetty
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)