Abstract
Extracting parallel data from comparable corpora in order to enrich existing statistical translation models is an avenue that attracted a lot of research in recent years. There are experiments that convincingly show how parallel data extracted from comparable corpora is able to improve statistical machine translation. Yet, the existing body of research on parallel sentence mining from comparable corpora does not take into account the degree of comparability of the corpus being processed or the computation time it takes to extract parallel sentences from a corpus of a given size. We will show that the performance of a parallel sentence extractor crucially depends on the degree of comparability such that it is more difficult to process a weakly comparable corpus than a strongly comparable corpus. In this paper we describe PEXACC, a distributed (running on multiple CPUs), trainable parallel sentence/phrase extractor from comparable corpora. PEXACC is freely available for download with the ACCURAT Toolkit, a collection of MT-related tools developed in the ACCURAT project.- Anthology ID:
- L12-1193
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2181–2188
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/382_Paper.pdf
- DOI:
- Cite (ACL):
- Radu Ion. 2012. PEXACC: A Parallel Sentence Mining Algorithm from Comparable Corpora. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2181–2188, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- PEXACC: A Parallel Sentence Mining Algorithm from Comparable Corpora (Ion, LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/382_Paper.pdf