Combining translation and language model scoring for domain-specific data filtering

Saab Mansour, Joern Wuebker, Hermann Ney


Abstract
The increasing popularity of statistical machine translation (SMT) systems is introducing new domains of translation that need to be tackled. As many resources are already available, domain adaptation methods can be applied to utilize these recourses in the most beneficial way for the new domain. We explore adaptation via filtering, using the crossentropy scores to discard irrelevant sentences. We focus on filtering for two important components of an SMT system, namely the language model (LM) and the translation model (TM). Previous work has already applied LM cross-entropy based scoring for filtering. We argue that LM cross-entropy might be appropriate for LM filtering, but not as much for TM filtering. We develop a novel filtering approach based on a combined TM and LM cross-entropy scores. We experiment with two large-scale translation tasks, the Arabic-to-English and English-to-French IWSLT 2011 TED Talks MT tasks. For LM filtering, we achieve strong perplexity improvements which carry over to the translation quality with improvements up to +0.4% BLEU. For TM filtering, the combined method achieves small but consistent improvements over the standalone methods. As a side effect of adaptation via filtering, the fully fledged SMT system vocabulary size and phrase table size are reduced by a factor of at least 2 while up to +0.6% BLEU improvement is observed.
Anthology ID:
2011.iwslt-papers.5
Volume:
Proceedings of the 8th International Workshop on Spoken Language Translation: Papers
Month:
December 8-9
Year:
2011
Address:
San Francisco, California
Editors:
Marcello Federico, Mei-Yuh Hwang, Margit Rödder, Sebastian Stüker
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Note:
Pages:
222–229
Language:
URL:
https://aclanthology.org/2011.iwslt-papers.5
DOI:
Bibkey:
Cite (ACL):
Saab Mansour, Joern Wuebker, and Hermann Ney. 2011. Combining translation and language model scoring for domain-specific data filtering. In Proceedings of the 8th International Workshop on Spoken Language Translation: Papers, pages 222–229, San Francisco, California.
Cite (Informal):
Combining translation and language model scoring for domain-specific data filtering (Mansour et al., IWSLT 2011)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2011.iwslt-papers.5.pdf