Using Wikipedia to translate domain-specific terms in SMT

Jan Niehues, Alex Waibel


Abstract
When building a university lecture translation system, one important step is to adapt it to the target domain. One problem in this adaptation task is to acquire translations for domain specific terms. In this approach we tried to get these translations from Wikipedia, which provides articles on very specific topics in many different languages. To extract translations for the domain specific terms, we used the interlanguage links of Wikipedia . We analyzed different methods to integrate this corpus into our system and explored methods to disambiguate between different translations by using the text of the articles. In addition, we developed methods to handle different morphological forms of the specific terms in morphologically rich input languages like German. The results show that the number of out-of-vocabulary (OOV) words could be reduced by 50% on computer science lectures and the translation quality could be improved by more than 1 BLEU point.
Anthology ID:
2011.iwslt-papers.6
Volume:
Proceedings of the 8th International Workshop on Spoken Language Translation: Papers
Month:
December 8-9
Year:
2011
Address:
San Francisco, California
Editors:
Marcello Federico, Mei-Yuh Hwang, Margit Rödder, Sebastian Stüker
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Note:
Pages:
230–237
Language:
URL:
https://aclanthology.org/2011.iwslt-papers.6
DOI:
Bibkey:
Cite (ACL):
Jan Niehues and Alex Waibel. 2011. Using Wikipedia to translate domain-specific terms in SMT. In Proceedings of the 8th International Workshop on Spoken Language Translation: Papers, pages 230–237, San Francisco, California.
Cite (Informal):
Using Wikipedia to translate domain-specific terms in SMT (Niehues & Waibel, IWSLT 2011)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2011.iwslt-papers.6.pdf