Same-Language Subtitles for Low-resource Languages: A Case of Bundelkhandi
Anirudh Pradhan, Ayushi Pandey, Divyansh Kushwaha, Akshita Tiwary, Vivek Seshadri
Abstract
Same-language subtitles enhance consumers’ experience for audiovisual content for both hearing impaired population. However, while high-resource languages can benefit from automatic subtitling, subtitles are seldom available for content creators in regional languages. This limits audience engagement on their content, which often is independently produced. This paper presents Project Saurakhi, a platform for generating same-language subtitles in regional languages. To achieve this, we first extract community-generated YouTube videos serve as the primary data source for this project. The current dataset comprises 63 hours of Bundelkhandi speech sourced from 207 YouTube videos across 19 content creators. And second, the technical workflow integrates automated stages with manual refinement via a mobile annotation platform. As regional language content grows both in independent productions, and in over-the-top platforms, Project Saurakhi aims to train women participants in rural India to become proficient in providing subtitles in their native languages. corpus creation, low-resource languages, Bundelkhandi, Indian languages, conversational AI, speech recognition, YouTube data- Anthology ID:
- 2026.lrec-main.246
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 3147–3153
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.246/
- DOI:
- Cite (ACL):
- Anirudh Pradhan, Ayushi Pandey, Divyansh Kushwaha, Akshita Tiwary, and Vivek Seshadri. 2026. Same-Language Subtitles for Low-resource Languages: A Case of Bundelkhandi. International Conference on Language Resources and Evaluation, main:3147–3153.
- Cite (Informal):
- Same-Language Subtitles for Low-resource Languages: A Case of Bundelkhandi (Pradhan et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.246.pdf