Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

Ashish Mittal; Durga Sivasubramanian; Rishabh Iyer; Preethi Jyothi; Ganesh Ramakrishnan

doi:10.18653/v1/2022.findings-emnlp.443

Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi, Ganesh Ramakrishnan

Abstract

Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, as RNN-T tend to have gradients with a significantly larger memory footprint. In this paper, we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3x to 6x speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.

Anthology ID:: 2022.findings-emnlp.443
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2022
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates
Editors:: Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5999–6010
Language:
URL:: https://aclanthology.org/2022.findings-emnlp.443
DOI:: 10.18653/v1/2022.findings-emnlp.443
Bibkey:
Cite (ACL):: Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi, and Ganesh Ramakrishnan. 2022. Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5999–6010, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training (Mittal et al., Findings 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl-24-ws-corrections/2022.findings-emnlp.443.pdf
Software:: 2022.findings-emnlp.443.software.zip
Video:: https://preview.aclanthology.org/naacl-24-ws-corrections/2022.findings-emnlp.443.mp4

PDF Search Software Video