Aligning Word Vectors on Low-Resource Languages with Wiktionary

Mike Izbicki


Abstract
Aligned word embeddings have become a popular technique for low-resource natural language processing. Most existing evaluation datasets are generated automatically from machine translations systems, so they have many errors and exist only for high-resource languages. We introduce the Wiktionary bilingual lexicon collection, which provides high-quality human annotated translations for words in 298 languages to English. We use these lexicons to train and evaluate the largest published collection of aligned word embeddings on 157 different languages. All of our code and data is publicly available at https://github.com/mikeizbicki/wiktionary_bli.
Anthology ID:
2022.loresmt-1.14
Volume:
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
LoResMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
107–117
Language:
URL:
https://aclanthology.org/2022.loresmt-1.14
DOI:
Bibkey:
Cite (ACL):
Mike Izbicki. 2022. Aligning Word Vectors on Low-Resource Languages with Wiktionary. In Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022), pages 107–117, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):
Aligning Word Vectors on Low-Resource Languages with Wiktionary (Izbicki, LoResMT 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.loresmt-1.14.pdf
Code
 mikeizbicki/wiktionary_bli