PEXACC: A Parallel Sentence Mining Algorithm from Comparable Corpora

Radu Ion


Abstract
Extracting parallel data from comparable corpora in order to enrich existing statistical translation models is an avenue that attracted a lot of research in recent years. There are experiments that convincingly show how parallel data extracted from comparable corpora is able to improve statistical machine translation. Yet, the existing body of research on parallel sentence mining from comparable corpora does not take into account the degree of comparability of the corpus being processed or the computation time it takes to extract parallel sentences from a corpus of a given size. We will show that the performance of a parallel sentence extractor crucially depends on the degree of comparability such that it is more difficult to process a weakly comparable corpus than a strongly comparable corpus. In this paper we describe PEXACC, a distributed (running on multiple CPUs), trainable parallel sentence/phrase extractor from comparable corpora. PEXACC is freely available for download with the ACCURAT Toolkit, a collection of MT-related tools developed in the ACCURAT project.
Anthology ID:
L12-1193
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2181–2188
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/382_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Radu Ion. 2012. PEXACC: A Parallel Sentence Mining Algorithm from Comparable Corpora. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2181–2188, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
PEXACC: A Parallel Sentence Mining Algorithm from Comparable Corpora (Ion, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/382_Paper.pdf