Data Selection for Bilingual Lexicon Induction from Specialized Comparable Corpora
Martin Laville, Amir Hazem, Emmanuel Morin, Phillippe Langlais
Abstract
Narrow specialized comparable corpora are often small in size. This particularity makes it difficult to build efficient models to acquire translation equivalents, especially for less frequent and rare words. One way to overcome this issue is to enrich the specialized corpora with out-of-domain resources. Although some recent studies have shown improvements using data augmentation, the enrichment method was roughly conducted by adding out-of-domain data with no particular attention given to how to enrich words and how to do it optimally. In this paper, we contrast several data selection techniques to improve bilingual lexicon induction from specialized comparable corpora. We first apply two well-established data selection techniques often used in machine translation that is: Tf-Idf and cross entropy. Then, we propose to exploit BERT for data selection. Overall, all the proposed techniques improve the quality of the extracted bilingual lexicons by a large margin. The best performing model is the cross entropy, obtaining a gain of about 4 points in MAP while decreasing computation time by a factor of 10.- Anthology ID:
- 2020.coling-main.527
- Volume:
- Proceedings of the 28th International Conference on Computational Linguistics
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Editors:
- Donia Scott, Nuria Bel, Chengqing Zong
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 6002–6012
- Language:
- URL:
- https://aclanthology.org/2020.coling-main.527
- DOI:
- 10.18653/v1/2020.coling-main.527
- Cite (ACL):
- Martin Laville, Amir Hazem, Emmanuel Morin, and Phillippe Langlais. 2020. Data Selection for Bilingual Lexicon Induction from Specialized Comparable Corpora. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6002–6012, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Cite (Informal):
- Data Selection for Bilingual Lexicon Induction from Specialized Comparable Corpora (Laville et al., COLING 2020)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2020.coling-main.527.pdf