Dual Subtitles as Parallel Corpora

Shikun Zhang, Wang Ling, Chris Dyer


Abstract
In this paper, we leverage the existence of dual subtitles as a source of parallel data. Dual subtitles present viewers with two languages simultaneously, and are generally aligned in the segment level, which removes the need to automatically perform this alignment. This is desirable as extracted parallel data does not contain alignment errors present in previous work that aligns different subtitle files for the same movie. We present a simple heuristic to detect and extract dual subtitles and show that more than 20 million sentence pairs can be extracted for the Mandarin-English language pair. We also show that extracting data from this source can be a viable solution for improving Machine Translation systems in the domain of subtitles.
Anthology ID:
L14-1137
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1869–1874
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1199_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Shikun Zhang, Wang Ling, and Chris Dyer. 2014. Dual Subtitles as Parallel Corpora. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1869–1874, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Dual Subtitles as Parallel Corpora (Zhang et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1199_Paper.pdf