AGE: Amharic, Ge’ez and English Parallel Dataset

Henok Ademtew; Mikiyas Birbo

AGE: Amharic, Ge’ez and English Parallel Dataset

Abstract

African languages are not well-represented in Natural Language Processing (NLP). The main reason is a lack of resources for training models. Low-resource languages, such as Amharic and Ge’ez, cannot benefit from modern NLP methods because of the lack of high-quality datasets. This paper presents AGE, an open-source tripartite alignment of Amharic, Ge’ez, and English parallel dataset. Additionally, we introduced a novel, 1,000 Ge’ez-centered sentences sourced from areas such as news and novels. Furthermore, we developed a model from a multilingual pre-trained language model, which brings 12.29 and 30.66 for English-Ge’ez and Ge’ez to English, respectively, and 9.39 and 12.29 for Amharic-Ge’ez and Ge’ez-Amharic respectively.

Anthology ID:: 2024.loresmt-1.14
Volume:: Proceedings of the The Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jade Abbott, Jonathan Washington, Nathaniel Oco, Valentin Malykh, Varvara Logacheva, Xiaobing Zhao
Venues:: LoResMT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 139–145
Language:
URL:: https://aclanthology.org/2024.loresmt-1.14
DOI:
Bibkey:
Cite (ACL):: Henok Ademtew and Mikiyas Birbo. 2024. AGE: Amharic, Ge’ez and English Parallel Dataset. In Proceedings of the The Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024), pages 139–145, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: AGE: Amharic, Ge’ez and English Parallel Dataset (Ademtew & Birbo, LoResMT-WS 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2024.loresmt-1.14.pdf

PDF Search