Shamsuddeen Muhammad


2024

pdf
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages
Nedjma Ousidhoum | Shamsuddeen Muhammad | Mohamed Abdalla | Idris Abdulmumin | Ibrahim Ahmad | Sanchit Ahuja | Alham Aji | Vladimir Araujo | Abinew Ayele | Pavan Baswani | Meriem Beloucif | Chris Biemann | Sofia Bourhim | Christine Kock | Genet Dekebo | Oumaima Hourrane | Gopichand Kanumolu | Lokesh Madasu | Samuel Rutunda | Manish Shrivastava | Thamar Solorio | Nirmal Surange | Hailegnaw Tilaye | Krishnapriya Vishnubhotla | Genta Winata | Seid Yimam | Saif Mohammad
Findings of the Association for Computational Linguistics ACL 2024

Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present SemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.

pdf
AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages
Jiayi Wang | David Adelani | Sweta Agrawal | Marek Masiak | Ricardo Rei | Eleftheria Briakou | Marine Carpuat | Xuanli He | Sofia Bourhim | Andiswa Bukula | Muhidin Mohamed | Temitayo Olatoye | Tosin Adewumi | Hamam Mokayed | Christine Mwase | Wangui Kimotho | Foutse Yuehgoh | Anuoluwapo Aremu | Jessica Ojo | Shamsuddeen Muhammad | Salomey Osei | Abdul-Hakeem Omotayo | Chiamaka Chukwuneke | Perez Ogayo | Oumaima Hourrane | Salma El Anigri | Lolwethu Ndolela | Thabiso Mangwana | Shafie Mohamed | Hassan Ayinde | Oluwabusayo Awoyomi | Lama Alkhaled | Sana Al-azzawi | Naome Etori | Millicent Ochieng | Clemencia Siro | Njoroge Kiragu | Eric Muchiri | Wangari Kimotho | Toadoum Sari Sakayo | Lyse Naomi Wamba | Daud Abolade | Simbiat Ajao | Iyanuoluwa Shode | Ricky Macharm | Ruqayya Iro | Saheed Abdullahi | Stephen Moore | Bernard Opoku | Zainab Akinjobi | Abeeb Afolabi | Nnaemeka Obiefuna | Onyekachi Ogbu | Sam Ochieng’ | Verrah Otiende | Chinedu Mbonu | Yao Lu | Pontus Stenetorp
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).

2023

pdf
Cross-lingual Open-Retrieval Question Answering for African Languages
Odunayo Ogundepo | Tajuddeen Gwadabe | Clara Rivera | Jonathan Clark | Sebastian Ruder | David Adelani | Bonaventure Dossou | Abdou Diop | Claytone Sikasote | Gilles Hacheme | Happy Buzaaba | Ignatius Ezeani | Rooweither Mabuya | Salomey Osei | Chris Emezue | Albert Kahira | Shamsuddeen Muhammad | Akintunde Oladipo | Abraham Owodunni | Atnafu Tonja | Iyanuoluwa Shode | Akari Asai | Anuoluwapo Aremu | Ayodele Awokoya | Bernard Opoku | Chiamaka Chukwuneke | Christine Mwase | Clemencia Siro | Stephen Arthur | Tunde Ajayi | Verrah Otiende | Andre Rubungo | Boyd Sinkala | Daniel Ajisafe | Emeka Onwuegbuzia | Falalu Lawan | Ibrahim Ahmad | Jesujoba Alabi | Chinedu Mbonu | Mofetoluwa Adeyemi | Mofya Phiri | Orevaoghene Ahia | Ruqayya Iro | Sonia Adhiambo
Findings of the Association for Computational Linguistics: EMNLP 2023

African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems – those that retrieve answer content from other languages while serving people in their native language—offer a means of filling this gap. To this end, we create Our Dataset, the first cross-lingual QA dataset with a focus on African languages. Our Dataset includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, Our Dataset focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, Our Dataset proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology.

pdf
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
Shamsuddeen Muhammad | Idris Abdulmumin | Abinew Ayele | Nedjma Ousidhoum | David Adelani | Seid Yimam | Ibrahim Ahmad | Meriem Beloucif | Saif Mohammad | Sebastian Ruder | Oumaima Hourrane | Alipio Jorge | Pavel Brazdil | Felermino Ali | Davis David | Salomey Osei | Bello Shehu-Bello | Falalu Lawan | Tajuddeen Gwadabe | Samuel Rutunda | Tadesse Destaw Belay | Wendimu Messelle | Hailu Balcha | Sisay Chala | Hagos Gebremichael | Bernard Opoku | Stephen Arthur
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Africa is home to over 2,000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yoruba) from four language families. The tweets were annotated by native speakers and used in the AfriSenti-SemEval shared task (with over 200 participants, see website: https://afrisenti-semeval.github.io). We describe the data collection methodology, annotation process, and the challenges we dealt with when curating each dataset. We further report baseline experiments conducted on the AfriSenti datasets and discuss their usefulness.

2022

pdf
A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
David Adelani | Jesujoba Alabi | Angela Fan | Julia Kreutzer | Xiaoyu Shen | Machel Reid | Dana Ruiter | Dietrich Klakow | Peter Nabende | Ernie Chang | Tajuddeen Gwadabe | Freshia Sackey | Bonaventure F. P. Dossou | Chris Emezue | Colin Leong | Michael Beukman | Shamsuddeen Muhammad | Guyo Jarso | Oreen Yousuf | Andre Niyongabo Rubungo | Gilles Hacheme | Eric Peter Wairagala | Muhammad Umair Nasir | Benjamin Ajibade | Tunde Ajayi | Yvonne Gitau | Jade Abbott | Mohamed Ahmed | Millicent Ochieng | Anuoluwapo Aremu | Perez Ogayo | Jonathan Mukiibi | Fatoumata Ouoba Kabore | Godson Kalipe | Derguene Mbaye | Allahsera Auguste Tapo | Victoire Memdjokam Koagne | Edwin Munkoh-Buabeng | Valencia Wagner | Idris Abdulmumin | Ayodele Awokoya | Happy Buzaaba | Blessing Sibanda | Andiswa Bukula | Sam Manthalu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent advances in the pre-training for language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages that are not well represented on the web and therefore excluded from the large-scale crawls for datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pretraining? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a novel African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both additional languages and additional domains is to leverage small quantities of high-quality translation data to fine-tune large pre-trained models.

pdf
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition
David Adelani | Graham Neubig | Sebastian Ruder | Shruti Rijhwani | Michael Beukman | Chester Palen-Michel | Constantine Lignos | Jesujoba Alabi | Shamsuddeen Muhammad | Peter Nabende | Cheikh M. Bamba Dione | Andiswa Bukula | Rooweither Mabuya | Bonaventure F. P. Dossou | Blessing Sibanda | Happy Buzaaba | Jonathan Mukiibi | Godson Kalipe | Derguene Mbaye | Amelia Taylor | Fatoumata Kabore | Chris Chinenye Emezue | Anuoluwapo Aremu | Perez Ogayo | Catherine Gitau | Edwin Munkoh-Buabeng | Victoire Memdjokam Koagne | Allahsera Auguste Tapo | Tebogo Macucwa | Vukosi Marivate | Mboning Tchiaze Elvis | Tajuddeen Gwadabe | Tosin Adewumi | Orevaoghene Ahia | Joyce Nakatumba-Nabende
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

African languages are spoken by over a billion people, but they are under-represented in NLP research and development. Multiple challenges exist, including the limited availability of annotated training and evaluation datasets as well as the lack of understanding of which settings, languages, and recently proposed methods like cross-lingual transfer will be effective. In this paper, we aim to move towards solutions for these challenges, focusing on the task of named entity recognition (NER). We present the creation of the largest to-date human-annotated NER dataset for 20 African languages. We study the behaviour of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, empirically demonstrating that the choice of source transfer language significantly affects performance. While much previous work defaults to using English as the source language, our results show that choosing the best transfer language improves zero-shot F1 scores by an average of 14% over 20 languages as compared to using English.

pdf
Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation
Idris Abdulmumin | Satya Ranjan Dash | Musa Abdullahi Dawud | Shantipriya Parida | Shamsuddeen Muhammad | Ibrahim Sa’id Ahmad | Subhadarshi Panda | Ondřej Bojar | Bashir Shehu Galadanci | Bello Shehu Bello
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations, especially where the full context is not available to enable the unambiguous translation in standard machine translation. Despite the increasing popularity of such technique, it lacks sufficient and qualitative datasets to maximize the full extent of its potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite the large number of speakers, the Hausa language is considered as a low resource language in natural language processing (NLP). This is due to the absence of enough resources to implement most of the tasks in NLP. While some datasets exist, they are either scarce, machine-generated or in the religious domain. Therefore, there is the need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. The dataset was prepared by automatically translating the English description of the images in the Hindi Visual Genome (HVG). The synthetic Hausa data was then carefully postedited, taking into cognizance the respective images. The data is made of 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, image description, among various other natural language processing and generation tasks.

pdf
Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages
Idris Abdulmumin | Michael Beukman | Jesujoba Alabi | Chris Chinenye Emezue | Everlyn Chimoto | Tosin Adewumi | Shamsuddeen Muhammad | Mofetoluwa Adeyemi | Oreen Yousuf | Sahib Singh | Tajuddeen Gwadabe
Proceedings of the Seventh Conference on Machine Translation (WMT)

We participated in the WMT 2022 Large-Scale Machine Translation Evaluation for the African Languages Shared Task. This work describes our approach, which is based on filtering the given noisy data using a sentence-pair classifier that was built by fine-tuning a pre-trained language model. To train the classifier, we obtain positive samples (i.e. high-quality parallel sentences) from a gold-standard curated dataset and extract negative samples (i.e. low-quality parallel sentences) from automatically aligned parallel data by choosing sentences with low alignment scores. Our final machine translation model was then trained on filtered data, instead of the entire noisy dataset. We empirically validate our approach by evaluating on two common datasets and show that data filtering generally improves overall translation quality, in some cases even significantly.

2020

pdf
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
Wilhelmina Nekoto | Vukosi Marivate | Tshinondiwa Matsila | Timi Fasubaa | Taiwo Fagbohungbe | Solomon Oluwole Akinola | Shamsuddeen Muhammad | Salomon Kabongo Kabenamualu | Salomey Osei | Freshia Sackey | Rubungo Andre Niyongabo | Ricky Macharm | Perez Ogayo | Orevaoghene Ahia | Musie Meressa Berhe | Mofetoluwa Adeyemi | Masabata Mokgesi-Selinga | Lawrence Okegbemi | Laura Martinus | Kolawole Tajudeen | Kevin Degila | Kelechi Ogueji | Kathleen Siminyu | Julia Kreutzer | Jason Webster | Jamiil Toure Ali | Jade Abbott | Iroro Orife | Ignatius Ezeani | Idris Abdulkadir Dangana | Herman Kamper | Hady Elsahar | Goodness Duru | Ghollah Kioko | Murhabazi Espoir | Elan van Biljon | Daniel Whitenack | Christopher Onyefuluchi | Chris Chinenye Emezue | Bonaventure F. P. Dossou | Blessing Sibanda | Blessing Bassey | Ayodele Olabiyi | Arshath Ramkilowan | Alp Öktem | Adewale Akinfaderin | Abdallah Bashir
Findings of the Association for Computational Linguistics: EMNLP 2020

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. ‘Low-resourced’-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released at https://github.com/masakhane-io/masakhane-mt.
Search
Co-authors