Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian
Maja Popović, Kostadin Cholakov, Valia Kordoni, Nikola Ljubešić
Abstract
Massive Open Online Courses have been growing rapidly in size and impact. Yet the language barrier constitutes a major growth impediment in reaching out all people and educating all citizens. A vast majority of educational material is available only in English, and state-of-the-art machine translation systems still have not been tailored for this peculiar genre. In addition, a mere collection of appropriate in-domain training material is a challenging task. In this work, we investigate statistical machine translation of lecture subtitles from English into Croatian, which is morphologically rich and generally weakly supported, especially for the educational domain. We show that results comparable with publicly available systems trained on much larger data can be achieved if a small in-domain training set is used in combination with additional in-domain corpus originating from the closely related Serbian language.- Anthology ID:
- W16-4813
- Volume:
- Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
- Month:
- December
- Year:
- 2016
- Address:
- Osaka, Japan
- Editors:
- Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
- Venue:
- VarDial
- SIG:
- Publisher:
- The COLING 2016 Organizing Committee
- Note:
- Pages:
- 97–105
- Language:
- URL:
- https://aclanthology.org/W16-4813
- DOI:
- Cite (ACL):
- Maja Popović, Kostadin Cholakov, Valia Kordoni, and Nikola Ljubešić. 2016. Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 97–105, Osaka, Japan. The COLING 2016 Organizing Committee.
- Cite (Informal):
- Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian (Popović et al., VarDial 2016)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/W16-4813.pdf