MuST-C: a Multilingual Speech Translation Corpus

Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo Negri, Marco Turchi

[How to correct problems with metadata yourself]


Abstract
Current research on spoken language translation (SLT) has to confront with the scarcity of sizeable and publicly available training corpora. This problem hinders the adoption of neural end-to-end approaches, which represent the state of the art in the two parent tasks of SLT: automatic speech recognition and machine translation. To fill this gap, we created MuST-C, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages. For each target language, MuST-C comprises at least 385 hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations. Together with a description of the corpus creation methodology (scalable to add new data and cover new languages), we provide an empirical verification of its quality and SLT results computed with a state-of-the-art approach on each language direction.
Anthology ID:
N19-1202
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Jill Burstein, Christy Doran, Thamar Solorio
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2012–2017
Language:
URL:
https://aclanthology.org/N19-1202
DOI:
10.18653/v1/N19-1202
Bibkey:
Cite (ACL):
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2019. MuST-C: a Multilingual Speech Translation Corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2012–2017, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
MuST-C: a Multilingual Speech Translation Corpus (Di Gangi et al., NAACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/teach-a-man-to-fish/N19-1202.pdf
Supplementary:
 N19-1202.Supplementary.pdf
Video:
 https://preview.aclanthology.org/teach-a-man-to-fish/N19-1202.mp4
Data
MuST-C