A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora
Mahdi Khademian, Kaveh Taghipour, Saab Mansour, Shahram Khadivi
Abstract
Achieving accurate translation, especially in multiple domain documents with statistical machine translation systems, requires more and more bilingual texts and this need becomes more critical when training such systems for language pairs with scarce training data. In the recent years, there have been some researches on new sources of parallel texts that are documents which are not necessarily parallel but are comparable. Since these methods search for possible translation equivalences in a greedy manner, they are unable to consider all possible parallel texts in comparable documents. This paper investigates a different approach for this need by considering relationships between all words of two comparable documents, which works fairly well even in the worst case of comparability. We represent each document pair in a matrix and then transform it to a new space to find parallel fragments. Evaluations show that the system is successful in extraction of useful fragment pairs.- Anthology ID:
- L12-1531
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 4073–4079
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/892_Paper.pdf
- DOI:
- Cite (ACL):
- Mahdi Khademian, Kaveh Taghipour, Saab Mansour, and Shahram Khadivi. 2012. A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 4073–4079, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora (Khademian et al., LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/892_Paper.pdf