Abstract
Aligned word embeddings have become a popular technique for low-resource natural language processing. Most existing evaluation datasets are generated automatically from machine translations systems, so they have many errors and exist only for high-resource languages. We introduce the Wiktionary bilingual lexicon collection, which provides high-quality human annotated translations for words in 298 languages to English. We use these lexicons to train and evaluate the largest published collection of aligned word embeddings on 157 different languages. All of our code and data is publicly available at https://github.com/mikeizbicki/wiktionary_bli.- Anthology ID:
- 2022.loresmt-1.14
- Volume:
- Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Venue:
- LoResMT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 107–117
- Language:
- URL:
- https://aclanthology.org/2022.loresmt-1.14
- DOI:
- Cite (ACL):
- Mike Izbicki. 2022. Aligning Word Vectors on Low-Resource Languages with Wiktionary. In Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022), pages 107–117, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Cite (Informal):
- Aligning Word Vectors on Low-Resource Languages with Wiktionary (Izbicki, LoResMT 2022)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2022.loresmt-1.14.pdf
- Code
- mikeizbicki/wiktionary_bli