Widening Natural Language Processing (2021)

Volumes

Proceedings of the Fifth Workshop on Widening Natural Language Processing 6 papers

bib (full) Proceedings of the Fifth Workshop on Widening Natural Language Processing

bib abs
OkwuGbé: End-to-End Speech Recognition for Fon and Igbo
Bonaventure F. P. Dossou | Chris Chinenye Emezue

Language is a fundamental component of human communication. African low-resourced languages have recently been a major subject of research in machine translation, and other text-based areas of NLP. However, there is still very little comparable research in speech recognition for African languages. OkwuGbé is a step towards building speech recognition systems for African low-resourced languages. Using Fon and Igbo as our case study, we build two end-to-end deep neural network-based speech recognition models. We present a state-of-the-art automatic speech recognition (ASR) model for Fon, and a benchmark ASR model result for Igbo. Our findings serve both as a guide for future NLP research for Fon and Igbo in particular, and the creation of speech recognition models for other African low-resourced languages in general. The Fon and Igbo models source code have been made publicly available. Moreover, Okwugbe, a python library has been created to make easier the process of ASR model building and training.

bib abs
TEET! Tunisian Dataset for Toxic Speech Detection
Slim Gharbi | Hatem Haddad | Mayssa Kchaou | Heger Arfaoui

The complete freedom of expression in social media has its costs especially in spreading harmful and abusive content that may induce people to act accordingly. Therefore, the need of detecting automatically such a content becomes an urgent task that will help and enhance the efficiency in limiting this toxic spread. Compared to other Arabic dialects which are mostly based on MSA, the Tunisian dialect is a combination of many other languages like MSA, Tamazight, Italian and French. Because of its rich language, dealing with NLP problems can be challenging due to the lack of large annotated datasets. In our context of detecting hate and abusive speech for tunisian dialect, the only existing annotated dataset is T-HSAB combining 6,039 annotated comments as hateful, abusive or normal. In this paper we are introducing a larger annotated dataset composed of approximately 10k of comments. We provide an in-depth exploration of its vocabulary as well as the results of the classification performance of machine learning classifiers like NB and SVM and deep learning models such as ARBERT, MARBERT and XLM-R.

abs
Developing Keyboards for the Endangered Livonian Language
Mika Hämäläinen | Khalid Alnajjar

We present our current work on developing keyboard layouts for a critically endangered Uralic language called Livonian. Our layouts work on Windows, MacOS and Linux. In addition, we have developed keyboard apps with predictive text for Android and iOS. This work has been conducted in collaboration with the language community.

In the current study, we analyzed 15297 texts from 39 cancer survivors who posted or commented on Reddit in order to detect the language particularities of cancer survivors from online discourse. We performed a computational linguistic analysis (part-of-speech analysis, emoji detection, sentiment analysis) on submissions around the time of the cancer diagnosis and around the time of remission. We found several significant differences in the texts posted around the time of remission compared to those around the time of diagnosis. Though our results need to be backed up by a higher corpus of data, they do cue to the fact that cancer survivors, around the time of remission, focus more on others, are more active on social media, and do not see the glass as half empty as suggested by the valence of the emojis.

abs
The Development of Pre-processing Tools and Pre-trained Embedding Models for Amharic
Tadesse Destaw | Abinew Ayele | Seid Muhie Yimam

Amharic is the second most spoken Semitic language after Arabic and serves as the official working language of Ethiopia. While Amharic NLP research is getting wider attention recently, the main bottleneck is that the resources and related tools are not publicly released, which makes it still a low-resource language. Due to this reason, we observe that different researchers try to repeat the same NLP research again and again. In this work, we investigate the existing approach in Amharic NLP and take the first step to publicly release tools, datasets, and models to advance Amharic NLP research. We build Python-based preprocessing tools for Amharic (tokenizer, sentence segmenter, and text cleaner) that can easily be used and integrated for the development of NLP applications. Furthermore, we compiled the first moderately large-scale Amharic text corpus (6.8m sentences) along with the word2Vec, fastText, RoBERTa, and FLAIR embeddings models. Finally, we compile benchmark datasets and build classification models for the named entity recognition task.