Million Meshesha


2020

pdf
Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo and Wolaytta
Solomon Teferra Abate | Martha Yifiru Tachbelie | Michael Melese | Hafte Abera | Tewodros Abebe | Wondwossen Mulugeta | Yaregal Assabie | Million Meshesha | Solomon Afnafu | Binyam Ephrem Seyoum
Proceedings of the Twelfth Language Resources and Evaluation Conference

Automatic Speech Recognition (ASR) is one of the most important technologies to support spoken communication in modern life. However, its development benefits from large speech corpus. The development of such a corpus is expensive and most of the human languages, including the Ethiopian languages, do not have such resources. To address this problem, we have developed four large (about 22 hours) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo and Wolaytta. To assess usability of the corpora for (the purpose of) speech processing, we have developed ASR systems for each language. In this paper, we present the corpora and the baseline ASR systems we have developed. We have achieved word error rates (WERs) of 37.65%, 31.03%, 38.02%, 33.89% for Amharic, Tigrigna, Oromo and Wolaytta, respectively. This results show that the corpora are suitable for further investigation towards the development of ASR systems. Thus, the research community can use the corpora to further improve speech processing systems. From our results, it is clear that the collection of text corpora to train strong language models for all of the languages is still required, especially for Oromo and Wolaytta.

2019


English-Ethiopian Languages Statistical Machine Translation
Solomon Teferra Abate | Michael Melese | Martha Yifiru Tachbelie | Million Meshesha | Solomon Atinafu | Wondwossen Mulugeta | Yaregal Assabie | Hafte Abera | Biniyam Ephrem | Tewodros Gebreselassie | Wondimagegnhue Tsegaye Tufa | Amanuel Lemma | Tsegaye Andargie | Seifedin Shifaw
Proceedings of the 2019 Workshop on Widening NLP

In this paper, we describe an attempt towards the development of parallel corpora for English and Ethiopian Languages, such as Amharic, Tigrigna, Afan-Oromo, Wolaytta and Ge’ez. The corpora are used for conducting bi-directional SMT experiments. The BLEU scores of the bi-directional SMT systems show a promising result. The morphological richness of the Ethiopian languages has a great impact on the performance of SMT especially when the targets are Ethiopian languages.

2018

pdf
Parallel Corpora for bi-Directional Statistical Machine Translation for Seven Ethiopian Language Pairs
Solomon Teferra Abate | Michael Melese | Martha Yifiru Tachbelie | Million Meshesha | Solomon Atinafu | Wondwossen Mulugeta | Yaregal Assabie | Hafte Abera | Binyam Ephrem | Tewodros Abebe | Wondimagegnhue Tsegaye | Amanuel Lemma | Tsegaye Andargie | Seifedin Shifaw
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing

In this paper, we describe the development of parallel corpora for Ethiopian Languages: Amharic, Tigrigna, Afan-Oromo, Wolaytta and Geez. To check the usability of all the corpora we conducted baseline bi-directional statistical machine translation (SMT) experiments for seven language pairs. The performance of the bi-directional SMT systems shows that all the corpora can be used for further investigations. We have also shown that the morphological complexity of the Ethio-Semitic languages has a negative impact on the performance of the SMT especially when they are target languages. Based on the results we obtained, we are currently working towards handling the morphological complexities to improve the performance of statistical machine translation among the Ethiopian languages.

pdf
Parallel Corpora for bi-lingual English-Ethiopian Languages Statistical Machine Translation
Solomon Teferra Abate | Michael Melese | Martha Yifiru Tachbelie | Million Meshesha | Solomon Atinafu | Wondwossen Mulugeta | Yaregal Assabie | Hafte Abera | Binyam Ephrem | Tewodros Abebe | Wondimagegnhue Tsegaye | Amanuel Lemma | Tsegaye Andargie | Seifedin Shifaw
Proceedings of the 27th International Conference on Computational Linguistics

In this paper, we describe an attempt towards the development of parallel corpora for English and Ethiopian Languages, such as Amharic, Tigrigna, Afan-Oromo, Wolaytta and Ge’ez. The corpora are used for conducting a bi-directional statistical machine translation experiments. The BLEU scores of the bi-directional Statistical Machine Translation (SMT) systems show a promising result. The morphological richness of the Ethiopian languages has a great impact on the performance of SMT specially when the targets are Ethiopian languages. Now we are working towards an optimal alignment for a bi-directional English-Ethiopian languages SMT.

2017

pdf
Amharic-English Speech Translation in Tourism Domain
Michael Melese | Laurent Besacier | Million Meshesha
Proceedings of the Workshop on Speech-Centric Natural Language Processing

This paper describes speech translation from Amharic-to-English, particularly Automatic Speech Recognition (ASR) with post-editing feature and Amharic-English Statistical Machine Translation (SMT). ASR experiment is conducted using morpheme language model (LM) and phoneme acoustic model(AM). Likewise,SMT conducted using word and morpheme as unit. Morpheme based translation shows a 6.29 BLEU score at a 76.4% of recognition accuracy while word based translation shows a 12.83 BLEU score using 77.4% word recognition accuracy. Further, after post-edit on Amharic ASR using corpus based n-gram, the word recognition accuracy increased by 1.42%. Since post-edit approach reduces error propagation, the word based translation accuracy improved by 0.25 (1.95%) BLEU score. We are now working towards further improving propagated errors through different algorithms at each unit of speech translation cascading component.

2012

pdf
Constructing Reference Semantic Predictions from Biomedical Knowledge Sources
Demeke Ayele | Jean-Pierre Chevallet | Million Meshesha | Getnet Kassie
Proceedings of COLING 2012