SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles

Volha Petukhova, Rodrigo Agerri, Mark Fishel, Sergio Penkale, Arantza del Pozo, Mirjam Sepesy Maučec, Andy Way, Panayota Georgakopoulou, Martin Volk


Abstract
Subtitling and audiovisual translation have been recognized as areas that could greatly benefit from the introduction of Statistical Machine Translation (SMT) followed by post-editing, in order to increase efficiency of subtitle production process. The FP7 European project SUMAT (An Online Service for SUbtitling by MAchine Translation: http://www.sumat-project.eu) aims to develop an online subtitle translation service for nine European languages, combined into 14 different language pairs, in order to semi-automate the subtitle translation processes of both freelance translators and subtitling companies on a large scale. In this paper we discuss the data collection and parallel corpus compilation for training SMT systems, which includes several procedures such as data partition, conversion, formatting, normalization and alignment. We discuss in detail each data pre-processing step using various approaches. Apart from the quantity (around 1 million subtitles per language pair), the SUMAT corpus has a number of very important characteristics. First of all, high quality both in terms of translation and in terms of high-precision alignment of parallel documents and their contents has been achieved. Secondly, the contents are provided in one consistent format and encoding. Finally, additional information such as type of content in terms of genres and domain is available.
Anthology ID:
L12-1027
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
21–28
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/154_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Volha Petukhova, Rodrigo Agerri, Mark Fishel, Sergio Penkale, Arantza del Pozo, Mirjam Sepesy Maučec, Andy Way, Panayota Georgakopoulou, and Martin Volk. 2012. SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 21–28, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles (Petukhova et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/154_Paper.pdf