Deriving translation units using small additional corpora

Carlos A. Henríquez Q., José B. Mariño, Rafael E. Banchs


Abstract
We present a novel strategy to derive new translation units using an additional bilingual corpus and a previously trained SMT system. The units were used to adapt the SMT system. The derivation process can be applied when the additional corpus is very small compared with the original train corpus and it does not require to compute new word alignments using all corpora. The strategy is based in the Levenshtein Distance and its resulting path. We reported a statistically significant improvement, with a confidence level of 99%, when adapting an Ngram-based Catalan-Spanish system using an additional corpus that represents less than 0.5% of the original train corpus. The additional translation units were able to solve morphological and lexical errors and added previously unknown words to the vocabulary.
Anthology ID:
2011.eamt-1.18
Volume:
Proceedings of the 15th Annual Conference of the European Association for Machine Translation
Month:
May 30–31
Year:
2011
Address:
Leuven, Belgium
Editors:
Mikel L. Forcada, Heidi Depraetere, Vincent Vandeghinste
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
Language:
URL:
https://preview.aclanthology.org/bulk-corrections-2025-11-25/2011.eamt-1.18/
DOI:
Bibkey:
Cite (ACL):
Carlos A. Henríquez Q., José B. Mariño, and Rafael E. Banchs. 2011. Deriving translation units using small additional corpora. In Proceedings of the 15th Annual Conference of the European Association for Machine Translation, Leuven, Belgium. European Association for Machine Translation.
Cite (Informal):
Deriving translation units using small additional corpora (Henríquez Q. et al., EAMT 2011)
Copy Citation:
PDF:
https://preview.aclanthology.org/bulk-corrections-2025-11-25/2011.eamt-1.18.pdf