The AMARA corpus: building resources for translating the web’s educational content

Francisco Guzman, Hassan Sajjad, Stephan Vogel, Ahmed Abdelali


Abstract
In this paper, we introduce a new parallel corpus of subtitles of educational videos: the AMARA corpus for online educational content. We crawl a multilingual collection community generated subtitles, and present the results of processing the Arabic–English portion of the data, which yields a parallel corpus of about 2.6M Arabic and 3.9M English words. We explore different approaches to align the segments, and extrinsically evaluate the resulting parallel corpus on the standard TED-talks tst-2010. We observe that the data can be successfully used for this task, and also observe an absolute improvement of 1.6 BLEU when it is used in combination with TED data. Finally, we analyze some of the specific challenges when translating the educational content.
Anthology ID:
2013.iwslt-papers.2
Volume:
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers
Month:
December 5-6
Year:
2013
Address:
Heidelberg, Germany
Editor:
Joy Ying Zhang
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Note:
Pages:
Language:
URL:
https://aclanthology.org/2013.iwslt-papers.2
DOI:
Bibkey:
Cite (ACL):
Francisco Guzman, Hassan Sajjad, Stephan Vogel, and Ahmed Abdelali. 2013. The AMARA corpus: building resources for translating the web’s educational content. In Proceedings of the 10th International Workshop on Spoken Language Translation: Papers, Heidelberg, Germany.
Cite (Informal):
The AMARA corpus: building resources for translating the web’s educational content (Guzman et al., IWSLT 2013)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2013.iwslt-papers.2.pdf