Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

Rooweither Mabuya, Don Mthobela, Mmasibidi Setaka, Menno Van Zaanen (Editors)

Anthology ID:: 2023.rail-1
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Venue:: RAIL
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2023.rail-1
DOI:
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/naacl24-info/2023.rail-1.pdf

pdf bib
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)
Rooweither Mabuya | Don Mthobela | Mmasibidi Setaka | Menno Van Zaanen

pdf bib abs
Automatic Spell Checker and Correction for Under-represented Spoken Languages: Case Study on Wolof
Thierno Ibrahima Cissé | Fatiha Sadat

This paper presents a spell checker and correction tool specifically designed for Wolof, an under-represented spoken language in Africa. The proposed spell checker leverages a combination of a trie data structure, dynamic programming, and the weighted Levenshtein distance to generate suggestions for misspelled words. We created novel linguistic resources for Wolof, such as a lexicon and a corpus of misspelled words, using a semi-automatic approach that combines manual and automatic annotation methods. Despite the limited data available for the Wolof language, the spell checker’s performance showed a predictive accuracy of 98.31% and a suggestion accuracy of 93.33%.Our primary focus remains the revitalization and preservation of Wolof as an Indigenous and spoken language in Africa, providing our efforts to develop novel linguistic resources. This work represents a valuable contribution to the growth of computational tools and resources for the Wolof language and provides a strong foundation for future studies in the automatic spell checking and correction field.

pdf bib abs
Unsupervised Cross-lingual Word Embedding Representation for English-isiZulu
Derwin Ngomane | Rooweither Mabuya | Jade Abbott | Vukosi Marivate

In this study, we investigate the effectiveness of using cross-lingual word embeddings for zero-shot transfer learning between a language with an abundant resource, English, and a languagewith limited resource, isiZulu. IsiZulu is a part of the South African Nguni language family, which is characterised by complex agglutinating morphology. We use VecMap, an open source tool, to obtain cross-lingual word embeddings. To perform an extrinsic evaluation of the effectiveness of the embeddings, we train a news classifier on labelled English data in order to categorise unlabelled isiZulu data using zero-shot transfer learning. In our study, we found our model to have a weighted average F1-score of 0.34. Our findings demonstrate that VecMap generates modular word embeddings in the cross-lingual space that have an impact on the downstream classifier used for zero-shot transfer learning.

pdf abs
Preparing the Vuk’uzenzele and ZA-gov-multilingual South African multilingual corpora
Richard Lastrucci | Jenalea Rajab | Matimba Shingange | Daniel Njini | Vukosi Marivate

This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering South African government speeches (ZA-gov-multilingual), as well as the South African Government newspaper (Vuk’uzenzele), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corpora were created to allow researchers to study the language used in South African government publications, with a focus on understanding how South African government officials communicate with their constituents. In this paper we highlight the process of gathering, cleaning and making available the corpora. We create parallel sentence corpora for Neural Machine Translation tasks using Language-Agnostic Sentence Representations (LASER) embeddings. With these aligned sentences we then provide NMT benchmarks for 9 indigenous languages by fine-tuning massively multilingual pre-trained language model.

pdf abs
SpeechReporting Corpus: annotated corpora of West African traditional narratives
Ekaterina Aplonova | Izabela Jordanoska | Timofey Arkhangelskiy | Tatiana Nikitina

This paper describes the SpeechReporting database, an online collection of corpora annotated for a range of discourse phenomena. The corpora contain folktales from 7 lesser-studied West African languages. Apart from its value for theoretical linguistics, especially for the study of reported speech, the database is an important resource for the preservation of intangible cultural heritage of minority languages and the development and testing of cross-linguistically applicable computational tools.

pdf abs
A Corpus-Based List of Frequently Used Words in Sesotho
Johannes Sibeko | Orphée De Clercq

This paper describes the SpeechReporting Corpus, an online collection of corpora annotated for a range of discourse phenomena. The corpora contain folktales from 7 lesser-studied West African languages. Apart from its value for theoretical linguistics, especially for the study of reported speech, the database is an important resource for the preservation of intangible cultural heritage of minority languages and the development and testing of cross-linguistically applicable computational tools.

pdf abs
Deep learning and low-resource languages: How much data is enough? A case study of three linguistically distinct South African languages
Roald Eiselen | Tanja Gaustad

In this paper we present a case study for three under-resourced linguistically distinct South African languages (Afrikaans, isiZulu, and Sesotho sa Leboa) to investigate the influence of data size and linguistic nature of a language on the performance of different embedding types. Our experimental setup consists of training embeddings on increasing amounts of data and then evaluating the impact of data size for the downstream task of part of speech tagging. We find that relatively little data can produce useful representations for this specific task for all three languages. Our analysis also shows that the influence of linguistic and orthographic differences between languages should not be underestimated: morphologically complex, conjunctively written languages (isiZulu in our case) need substantially more data to achieve good results, while disjunctively written languages require substantially less data. This is not only the case with regard to the data for training the embedding model, but also annotated training material for the task at hand. It is therefore imperative to know the characteristics of the language you are working on to make linguistically informed choices about the amount of data and the type of embeddings to use.

pdf abs
IsiXhosa Intellectual Traditions Digital Archive: Digitizing isiXhosa texts from 1870-1914
Jonathan Schoots | Amandla Ngwendu | Jacques De Wet | Sanjin Muftic

This article offers an overview of the IsiXhosa Intellectual Traditions Digital Archive, which hosts digitized texts and images of early isiXhosa newspapers and books from 1870-1914. The archive offers new opportunities for a range of research across multiple fields, and responds to debates around the importance of African intellectual traditions and their indigenous language sources in generating African social sciences which is contextually relevant. We outline the content and context of these materials and offer qualitative and quantitative details with the aim of providing an overview for interested scholars and a reference for those using the archive.

pdf abs
Analyzing political formation through historical isiXhosa text analysis: Using frequency analysis to examine emerging African Nationalism in South Africa
Jonathan Schoots

This paper showcases new research avenues made possible by applying computational methods to historical isiXhosa text. I outline a method for isiXhosa computational text analysis which adapts word frequency analysis to be applied to isiXhosa texts focusing on root words. The paper showcases the value of the approach in a study of emerging political identities in early African nationalism, examining a novel dataset of isiXhosa newspapers from 1874 to 1890. The analysis shows how a shared identity of ‘Blackness’ (Abantsundu and Abamnyama) dynamically emerged, and follows the impact of leading intellectuals as well as African voter mobilization in shaping communal political discourse.

pdf abs
Evaluating the Sesotho rule-based syllabification system on Sepedi and Setswana words
Johannes Sibeko | Mmasibidi Setaka

The purpose of this article is to demonstrate that the recently developed automated rule-based syllabification system for Sesotho can be used broadly across the officially recognised South African Sotho-Tswana language group encompassing Sepedi, Sesotho and Setswana. We evaluate the automatic syllabification system on 400 words comprising 100 most frequently used words and 100 least-used words in Sepedi and Setswana as evident in the Autshumato corpus publicly available online. It is found that the Sesotho rule-based syllabification system can be used to correctly identify vowel-only syllables, consonant-vowel syllables and consonant-only syllables in Sepedi and Setswana. Among other findings, it has been demonstrated that words with diacritics as in the case of Sepedi are correctly broken down into syllables. We make two main recommendations. First, the rules for syllabification should be updated so that Sepedi diacritics are accommodated. Second, the syllabification system should be updated so that it reflects the broader Sotho-Tswana language group instead of being limited to Sesotho. Further research is needed to ascertain whether the complex consonant [ny] behaves similarly in all three officially recognised Sotho-Tswana languages and evaluate the need for a specific rule for the [ny] digraph.

pdf abs
Towards a Swahili Universal Dependency Treebank: Leveraging the Annotations of the Helsinki Corpus of Swahili
Kenneth Steimel | Sandra Kübler

Dependency annotation can be a laborious process for under-resourced languages. However, in some cases, other resources are available. We investigate whether we can leverage such resources in the case of Swahili: We use the Helsinki Corpus of Swahili for creating a Universal Depedencies treebank for Swahili. The Helsinki Corpus of Swahili provides word-level annotations for part of speech tags, morphological features, and functional syntactic tags. We train neural taggers for these types of annotations, then use those models to annotate our target corpus, the Swahili portion of the OPUS Global Voices Corpus. Based on those annotations, we then manually create constraint grammar rules to annotate the target corpus for Universal Dependencies. In this paper, we describe the process, discuss the annotation decisions we had to make, and we evaluate the approach.

pdf abs
Comparing methods of orthographic conversion for Bàsàá, a language of Cameroon
Alexandra O’neil | Daniel Swanson | Robert Pugh | Francis Tyers | Emmanuel Ngue Um

Orthographical standardization is a milestone in a language’s documentation and the development of its resources. However, texts written in former orthographies remain relevant to the language’s history and development and therefore must be converted to the standardized orthography. Ensuring a language has access to the orthographically standardized version of all of its recorded texts is important in the development of resources as it provides additional textual resources for training, supports contribution of authors using former writing systems, and provides information about the development of the language. This paper evaluates the performance of natural language processing methods, specifically Finite State Transducers and Long Short-term Memory networks, for the orthographical conversion of Bàsàá texts from the Protestant missionary orthography to the now-standard AGLC orthography, with the conclusion that LSTMs are somewhat more effective in the absence of explicit lexical information.

pdf abs
Vowels and the Igala Language Resources
Mahmud Momoh

The aim of this article is to provide some insight into the use of the diacritic orthography in the writing of the Igala Language corpus. The aim was to use a lexical approach in identifying some of the words inherent in the language. Examples with sentences and interpretation were also provided with footnotes illustrations to better expiate some of the words and examples that could not be reflected upon in the main body of the work. The article as a matter of fact combines up to seven diacritic forms in order to better tackle the oft en-counter problem of pronouncing words in texts written in foreign language with supporting indicators provided in the work to guide users on how to pronounce the words using the diacritic forms that were adopted for the sake of this work. A total of 30 vowels were identified (5 short vowels and 25 long vowels of different variety) plus 8 diphthongs.

pdf abs
Investigating Sentiment-Bearing Words- and Emoji-based Distant Supervision Approaches for Sentiment Analysis
Ronny Mabokela | Mpho Roborife | Turguy Celik

Sentiment analysis focuses on the automatic detection and classification of opinions expressed in texts. Emojis can be used to determine the sentiment polarities of the texts (i.e. positive, negative, or neutral). Several studies demonstrated how sentiment analysis is accurate when emojis are used (Kaity and Balakrishnan, 2020). While they have used emojis as features to improve the performance of sentiment analysis systems, in this paper we analyse the use of emojis to reduce the manual effort inlabelling text for training those systems. Furthermore, we investigate the manual effort reduction in the sentiment labelling process with the help of sentiment-bearing words as well as the combination of sentiment-bearing words and emojis. In addition to English, we evaluated the approaches with the low-resource African languages Sepedi, Setswana, and Sesotho. The combination of emojis and words sentiment lexicon shows better performance compared to emojis-only lexicons and words-based lexicons. Our results show that our emoji sentiment lexicon approach is effective, with an accuracy of 75% more than other sentiment lexicon approaches, which have an average accuracy of 69.1%. Furthermore, our distant supervision method obtained an accuracy of 76%. We anticipate that only 24% of the tweets will need to be changed as a result of our annotation strategies

This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta. Through this paper, we identify key challenges and opportunities for NLP research in Ethiopia.Furthermore, we provide a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages. This repository can be updated periodically with contributions from other researchers. Our objective is to disseminate information to NLP researchers interested in Ethiopian languages and encourage future research in this domain.