Low Cost Portability for Statistical Machine Translation based on N-gram Coverage

Matthias Eck, Stephan Vogel, Alex Waibel


Abstract
Statistical machine translation relies heavily on the available training data. However, in some cases, it is necessary to limit the amount of training data that can be created for or actually used by the systems. To solve that problem, we introduce a weighting scheme that tries to select more informative sentences first. This selection is based on the previously unseen n-grams the sentences contain, and it allows us to sort the sentences according to their estimated importance. After sorting, we can construct smaller training corpora, and we are able to demonstrate that systems trained on much less training data show a very competitive performance compared to baseline systems using all available training data.
Anthology ID:
2005.mtsummit-papers.30
Volume:
Proceedings of Machine Translation Summit X: Papers
Month:
September 13-15
Year:
2005
Address:
Phuket, Thailand
Venue:
MTSummit
SIG:
Publisher:
Note:
Pages:
227–234
Language:
URL:
https://aclanthology.org/2005.mtsummit-papers.30
DOI:
Bibkey:
Cite (ACL):
Matthias Eck, Stephan Vogel, and Alex Waibel. 2005. Low Cost Portability for Statistical Machine Translation based on N-gram Coverage. In Proceedings of Machine Translation Summit X: Papers, pages 227–234, Phuket, Thailand.
Cite (Informal):
Low Cost Portability for Statistical Machine Translation based on N-gram Coverage (Eck et al., MTSummit 2005)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-2023-videos/2005.mtsummit-papers.30.pdf