Mateja Verlič

Also published as: Mateja Verlic

2013

pdf
Application of Localized Similarity for Web Documents
Peter Reberšek | Mateja Verlič
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2012

Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional data needed for statistical machine translation to be extracted from comparable corpora. We present methods and tools for acquisition of comparable corpora from the Web and other sources, for evaluation of the comparability of collected corpora, for multi-level alignment of comparable corpora and for extraction of lexical and terminological data for machine translation. Finally, we present initial evaluation results on the utility of collected corpora in domain-adapted machine translation and real-life applications.

Co-authors

Monica Lestari Paramita 1

Mārcis Pinnis 1

Venues

emnlp1
lrec1