On the Impact of Noises in Crowd-Sourced Data for Speech Translation

Siqi Ouyang, Rong Ye, Lei Li


Abstract
Training speech translation (ST) models requires large and high-quality datasets. MuST-C is one of the most widely used ST benchmark datasets. It contains around 400 hours of speech-transcript-translation data for each of the eight translation directions. This dataset passes several quality-control filters during creation. However, we find that MuST-C still suffers from three major quality issues: audiotext misalignment, inaccurate translation, and unnecessary speaker’s name. What are the impacts of these data quality issues for model development and evaluation? In this paper, we propose an automatic method to fix or filter the above quality issues, using English-German (En-De) translation as an example. Our experiments show that ST models perform better on clean test sets, and the rank of proposed models remains consistent across different test sets. Besides, simply removing misaligned data points from the training set does not lead to a better ST model.
Anthology ID:
2022.iwslt-1.9
Original:
2022.iwslt-1.9v1
Version 2:
2022.iwslt-1.9v2
Volume:
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)
Month:
May
Year:
2022
Address:
Dublin, Ireland (in-person and online)
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Association for Computational Linguistics
Note:
Pages:
92–97
Language:
URL:
https://aclanthology.org/2022.iwslt-1.9
DOI:
10.18653/v1/2022.iwslt-1.9
Bibkey:
Cite (ACL):
Siqi Ouyang, Rong Ye, and Lei Li. 2022. On the Impact of Noises in Crowd-Sourced Data for Speech Translation. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pages 92–97, Dublin, Ireland (in-person and online). Association for Computational Linguistics.
Cite (Informal):
On the Impact of Noises in Crowd-Sourced Data for Speech Translation (Ouyang et al., IWSLT 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2022.iwslt-1.9.pdf
Code
 owaski/must-c-clean
Data
MuST-C