Anubha Gupta

2026

Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages
Swati Sharma | Divya V. Sharma | Anubha Gupta
Proceedings of the Fifteenth Language Resources and Evaluation Conference

The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially acute in linguistically diverse countries, such as India. Cross-task profiling of existing Indian speech datasets can alleviate the data scarcity challenge. This involves investigating the utility of datasets across multiple downstream tasks rather than focusing on a single task. Prior surveys typically catalogue datasets for a single task, leaving comprehensive cross-task profiling as an open opportunity. Therefore, we propose Task-Lens, a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks. First, we analyze which datasets contain metadata and properties suitable for specific tasks. Next, we propose task-aligned enhancements to unlock datasets to their full downstream potential. Finally, we identify tasks and Indian languages that are critically underserved by current resources. Our findings reveal that many Indian speech datasets contain untapped metadata that can support multiple downstream tasks. By uncovering cross-task linkages and gaps, Task-Lens enables researchers to explore the broader applicability of existing datasets and to prioritize dataset creation for underserved tasks and languages.

2025

pdf bib abs

IndicSynth: A Large-Scale Multilingual Synthetic Speech Dataset for Low-Resource Indian Languages
Divya V Sharma | Vijval Ekbote | Anubha Gupta
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advances in synthetic speech generation technology have facilitated the generation of high-quality synthetic (fake) speech that emulates human voices. These technologies pose a threat of misuse for identity theft and the spread of misinformation. Consequently, the misuse of such powerful technologies necessitates the development of robust and generalizable audio deepfake detection (ADD) and anti-spoofing models. However, such models are often linguistically biased. Consequently, the models trained on datasets in one language exhibit a low accuracy when evaluated on out-of-domain languages. Such biases reduce the usability of these models and highlight the urgent need for multilingual synthetic speech datasets for bias mitigation research. However, most available datasets are in English or Chinese. The dearth of multilingual synthetic datasets hinders multilingual ADD and anti-spoofing research. Furthermore, the problem intensifies in countries with rich linguistic diversity, such as India. Therefore, we introduce IndicSynth, which contains 4,000 hours of synthetic speech from 989 target speakers, including 456 females and 533 males for 12 low-resourced Indian languages. The dataset includes rich metadata covering gender details and target speaker identifiers. Experimental results demonstrate that IndicSynth is a valuable contribution to multilingual ADD and anti-spoofing research. The dataset can be accessed from https://github.com/vdivyas/IndicSynth.

Co-authors

Venues

ACL1
LREC1

Fix author