Yonas Woldemariam


2020

pdf
Transfer Learning for Less-Resourced Semitic Languages Speech Recognition: the Case of Amharic
Yonas Woldemariam
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

While building automatic speech recognition (ASR) requires a large amount of speech and text data, the problem gets worse for less-resourced languages. In this paper, we investigate a model adaptation method, namely transfer learning for a less-resourced Semitic language i.e., Amharic, to solve resource scarcity problems in speech recognition development and improve the Amharic ASR model. In our experiments, we transfer acoustic models trained on two different source languages (English and Mandarin) to Amharic using very limited resources. The experimental results show that a significant WER (Word Error Rate) reduction has been achieved by transferring the hidden layers of the trained source languages neural networks. In the best case scenario, the Amharic ASR model adapted from English yields the best WER reduction from 38.72% to 24.50% (an improvement of 14.22% absolute). Adapting the Mandarin model improves the baseline Amharic model with a WER reduction of 10.25% (absolute). Our analysis also reveals that, the speech recognition performance of the adapted acoustic model is highly influenced by the relatedness (in a relative sense) between the source and the target languages than other considered factors (e.g. the quality of source models). Furthermore, other Semitic as well as Afro-Asiatic languages could benefit from the methodology presented in this study.

pdf
Adapting Language Specific Components of Cross-Media Analysis Frameworks to Less-Resourced Languages: the Case of Amharic
Yonas Woldemariam | Adam Dahlgren
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

We present an ASR based pipeline for Amharic that orchestrates NLP components within a cross media analysis framework (CMAF). One of the major challenges that are inherently associated with CMAFs is effectively addressing multi-lingual issues. As a result, many languages remain under-resourced and fail to leverage out of available media analysis solutions. Although spoken natively by over 22 million people and there is an ever-increasing amount of Amharic multimedia content on the Web, querying them with simple text search is difficult. Searching for, especially audio/video content with simple key words, is even hard as they exist in their raw form. In this study, we introduce a spoken and textual content processing workflow into a CMAF for Amharic. We design an ASR-named entity recognition (NER) pipeline that includes three main components: ASR, a transliterator and NER. We explore various acoustic modeling techniques and develop an OpenNLP-based NER extractor along with a transliterator that interfaces between ASR and NER. The designed ASR-NER pipeline for Amharic promotes the multi-lingual support of CMAFs. Also, the state-of-the art design principles and techniques employed in this study shed light for other less-resourced languages, particularly the Semitic ones.

pdf
Assessing Users’ Reputation from Syntactic and Semantic Information in Community Question Answering
Yonas Woldemariam
Proceedings of the Twelfth Language Resources and Evaluation Conference

Textual content is the most significant as well as substantially the big part of CQA (Community Question Answering) forums. Users gain reputation for contributing such content. Although linguistic quality is the very essence of textual information, that does not seem to be considered in estimating users’ reputation. As existing users’ reputation systems seem to solely rely on vote counting, adding that bit of linguistic information surely improves their quality. In this study, we investigate the relationship between users’ reputation and linguistic features extracted from their associated answers content. And we build statistical models on a Stack Overflow dataset that learn reputation from complex syntactic and semantic structures of such content. The resulting models reveal how users’ writing styles in answering questions play important roles in building reputation points. In our experiments, extracting answers from systematically selected users followed by linguistic features annotation and models building. The models are evaluated on in-domain (e.g., Server Fault, Super User) and out-domain (e.g., English, Maths) datasets. We found out that the selected linguistic features have quite significant influences over reputation scores. In the best case scenario, the selected linguistic feature set could explain 80% variation in reputation scores with the prediction error of 3%. The performance results obtained from the baseline models have been significantly improved by adding syntactic and punctuation marks features.

2017

pdf
Predicting User Competence from Linguistic Data
Yonas Woldemariam | Henrik Björklund | Suna Bensch
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)