Reem AlYami


2024

pdf
The KIND Dataset: A Social Collaboration Approach for Nuanced Dialect Data Collection
Asma Yamani | Raghad Alziyady | Reem AlYami | Salma Albelali | Leina Albelali | Jawharah Almulhim | Amjad Alsulami | Motaz Alfarraj | Rabeah Al-Zaidy
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Nuanced dialects are a linguistic variant that pose several challenges for NLP models and techniques. One of the main challenges is the limited amount of datasets to enable extensive research and experimentation. We propose an approach for efficiently collecting nuanced dialectal datasets that are not only of high quality, but are versatile enough to be multipurpose as well. To test our approach we collect the KIND corpus, which is a collection of fine-grained Arabic dialect data. The data is short texts, and unlike many nuanced dialectal datasets, it is curated manually through social collaboration efforts as opposed to being crawled from social media. The collaborative approach is incentivized through educational gamification and competitions for which the community itself benefits from the open source dataset. Our approach aims to achieve: (1) coverage of dialects from under-represented groups and fine-grained dialectal varieties, (2) provide aligned parallel corpora for translation between Modern Standard Arabic (MSA) and multiple dialects to enable translation and comparison studies, (3) promote innovative approaches for nuanced dialect data collection. We explain the steps for the competition as well as the resulting datasets and the competing data collection systems. The KIND dataset is shared with the research community.

2022

pdf
Weakly and Semi-Supervised Learning for Arabic Text Classification using Monodialectal Language Models
Reem AlYami | Rabah Al-Zaidy
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

The lack of resources such as annotated datasets and tools for low-resource languages is a significant obstacle to the advancement of Natural Language Processing (NLP) applications targeting users who speak these languages. Although learning techniques such as semi-supervised and weakly supervised learning are effective in text classification cases where annotated data is limited, they are still not widely investigated in many languages due to the sparsity of data altogether, both labeled and unlabeled. In this study, we deploy both weakly, and semi-supervised learning approaches for text classification in low-resource languages and address the underlying limitations that can hinder the effectiveness of these techniques. To that end, we propose a suite of language-agnostic techniques for large-scale data collection, automatic data annotation, and language model training in scenarios where resources are scarce. Specifically, we propose a novel data collection pipeline for under-represented languages, or dialects, that is language and task agnostic and of sufficient size for training a language model capable of achieving competitive results on common NLP tasks, as our experiments show. The models will be shared with the research community.