MuST-C: a Multilingual Speech Translation Corpus

Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo Negri, Marco Turchi


Abstract
Current research on spoken language translation (SLT) has to confront with the scarcity of sizeable and publicly available training corpora. This problem hinders the adoption of neural end-to-end approaches, which represent the state of the art in the two parent tasks of SLT: automatic speech recognition and machine translation. To fill this gap, we created MuST-C, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages. For each target language, MuST-C comprises at least 385 hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations. Together with a description of the corpus creation methodology (scalable to add new data and cover new languages), we provide an empirical verification of its quality and SLT results computed with a state-of-the-art approach on each language direction.
Anthology ID:
N19-1202
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Jill Burstein, Christy Doran, Thamar Solorio
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2012–2017
Language:
URL:
https://aclanthology.org/N19-1202
DOI:
10.18653/v1/N19-1202
Bibkey:
Cite (ACL):
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2019. MuST-C: a Multilingual Speech Translation Corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2012–2017, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
MuST-C: a Multilingual Speech Translation Corpus (Di Gangi et al., NAACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-bitext-workshop/N19-1202.pdf
Supplementary:
 N19-1202.Supplementary.pdf
Video:
 https://preview.aclanthology.org/ingest-bitext-workshop/N19-1202.mp4
Data
MuST-C