Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training
Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi, Ganesh Ramakrishnan
Abstract
Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, as RNN-T tend to have gradients with a significantly larger memory footprint. In this paper, we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3x to 6x speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.- Anthology ID:
- 2022.findings-emnlp.443
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2022
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates
- Editors:
- Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5999–6010
- Language:
- URL:
- https://aclanthology.org/2022.findings-emnlp.443
- DOI:
- 10.18653/v1/2022.findings-emnlp.443
- Cite (ACL):
- Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi, and Ganesh Ramakrishnan. 2022. Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5999–6010, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training (Mittal et al., Findings 2022)
- PDF:
- https://preview.aclanthology.org/naacl-24-ws-corrections/2022.findings-emnlp.443.pdf