Compiling a Highly Accurate Bilingual Lexicon by Combining Different Approaches

Steinþór Steingrímsson, Luke O’Brien, Finnur Ingimundarson, Hrafn Loftsson, Andy Way


Abstract
Bilingual lexicons can be generated automatically using a wide variety of approaches. We perform a rigorous manual evaluation of four different methods: word alignments on different types of bilingual data, pivoting, machine translation and cross-lingual word embeddings. We investigate how the different setups perform using publicly available data for the English-Icelandic language pair, doing separate evaluations for each method, dataset and confidence class where it can be calculated. The results are validated by human experts, working with a random sample from all our experiments. By combining the most promising approaches and data sets, using confidence scores calculated from the data and the results of manually evaluating samples from our manual evaluation as indicators, we are able to induce lists of translations with a very high acceptance rate. We show how multiple different combinations generate lists with well over 90% acceptance rate, substantially exceeding the results for each individual approach, while still generating reasonably large candidate lists. All manually evaluated equivalence pairs are published in a new lexicon of over 232,000 pairs under an open license.
Anthology ID:
2022.gwll-1.6
Volume:
Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Ilan Kernerman, Simon Krek
Venue:
gwll
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
32–41
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2022.gwll-1.6/
DOI:
Bibkey:
Cite (ACL):
Steinþór Steingrímsson, Luke O’Brien, Finnur Ingimundarson, Hrafn Loftsson, and Andy Way. 2022. Compiling a Highly Accurate Bilingual Lexicon by Combining Different Approaches. In Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference, pages 32–41, Marseille, France. European Language Resources Association.
Cite (Informal):
Compiling a Highly Accurate Bilingual Lexicon by Combining Different Approaches (Steingrímsson et al., gwll 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2022.gwll-1.6.pdf