Dealing with unknown words in statistical machine translation

João Silva, Luísa Coheur, Ângela Costa, Isabel Trancoso


Abstract
In Statistical Machine Translation, words that were not seen during training are unknown words, that is, words that the system will not know how to translate. In this paper we contribute to this research problem by profiting from orthographic cues given by words. Thus, we report a study of the impact of word distance metrics in cognates' detection and, in addition, on the possibility of obtaining possible translations of unknown words through Logical Analogy. Our approach is tested in the translation of corpora from Portuguese to English (and vice-versa).
Anthology ID:
L12-1585
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3911–3981
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/980_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
João Silva, Luísa Coheur, Ângela Costa, and Isabel Trancoso. 2012. Dealing with unknown words in statistical machine translation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3911–3981, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Dealing with unknown words in statistical machine translation (Silva et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/980_Paper.pdf