Giuseppe Della Corte

2020

pdf bib abs
IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation
Giuseppe Della Corte | Sara Stymne
Proceedings of the First International Workshop on Natural Language Processing Beyond Text

We discuss a set of methods for the creation of IESTAC: a English-Italian speech and text parallel corpus designed for the training of end-to-end speech-to-text machine translation models and publicly released as part of this work. We first mapped English LibriVox audiobooks and their corresponding English Gutenberg Project e-books to Italian e-books with a set of three complementary methods. Then we aligned the English and the Italian texts using both traditional Gale-Church based alignment methods and a recently proposed tool to perform bilingual sentences alignment computing the cosine similarity of multilingual sentence embeddings. Finally, we forced the alignment between the English audiobooks and the English side of our textual parallel corpus with a text-to-speech and dynamic time warping based forced alignment tool. For each step, we provide the reader with a critical discussion based on detailed evaluation and comparison of the results of the different methods.

Co-authors

Sara Stymne 1

Venues

nlpbt1

Fix data