Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi, Ganesh Ramakrishnan


Abstract
Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, as RNN-T tend to have gradients with a significantly larger memory footprint. In this paper, we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3x to 6x speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.
Anthology ID:
2022.findings-emnlp.443
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5999–6010
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.443
DOI:
10.18653/v1/2022.findings-emnlp.443
Bibkey:
Cite (ACL):
Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi, and Ganesh Ramakrishnan. 2022. Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5999–6010, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training (Mittal et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2022.findings-emnlp.443.pdf
Software:
 2022.findings-emnlp.443.software.zip
Video:
 https://preview.aclanthology.org/naacl-24-ws-corrections/2022.findings-emnlp.443.mp4