2025
pdf
bib
abs
Dotless Arabic Text for Natural Language Processing
Maged S. Al-Shaibani
|
Irfan Ahmad
Computational Linguistics, Volume 51, Issue 2 - June 2025
This article introduces a novel representation of Arabic text as an alternative approach for Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation through extensive analysis on various text corpora, differing in size and domain, and tokenized using multiple tokenization techniques. Furthermore, we examined the information density of this representation and compared it with the standard dotted Arabic text using text entropy analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English text analysis to gain additional insights. Our investigation extended to various upstream and downstream NLP tasks, including language modeling, text classification, sequence labeling, and machine translation, examining the implications of both the representations. Specifically, we performed seven different downstream tasks using various tokenization schemes comparing the standard dotted text with dotless Arabic text representations. Performance using both the representations was comparable across different tokenizations. However, dotless representation achieves these results with significant reduction in vocabulary sizes, and in some scenarios showing reduction of up to 50%. Additionally, we present a system that restores dots to the dotless Arabic text. This system is useful for tasks that require Arabic texts as output.
2024
pdf
bib
abs
CIDAR: Culturally Relevant Instruction Dataset For Arabic
Zaid Alyafeai
|
Khalid Almubarak
|
Ahmed Ashraf
|
Deema Alnuhait
|
Saied Alshahrani
|
Gubran A. Q. Abdulrahman
|
Gamil Ahmed
|
Qais Gawah
|
Zead Saleh
|
Mustafa Ghaleb
|
Yousef Ali
|
Maged S. Al-shaibani
Findings of the Association for Computational Linguistics: ACL 2024
Instruction tuning has emerged as a prominent methodology for teaching Large Language Models (LLMs) to follow instructions. However, current instruction datasets predominantly cater to English or are derived from English-dominated LLMs, leading to inherent biases toward Western culture. This bias negatively impacts non-English languages such as Arabic and the unique culture of the Arab region. This paper addresses this limitation by introducing CIDAR, the first open Arabic instruction-tuning dataset culturally aligned by native Arabic speakers. CIDAR contains 10,000 instruction and output pairs that represent the Arab region. We discuss the cultural relevance of CIDAR via the analysis and comparison to a few models fine-tuned on other datasets. Our experiments indicate that models fine-tuned on CIDAR achieve better cultural alignment compared to those fine-tuned on 30x more data.
2023
pdf
bib
abs
Consonant is all you need: a compact representation of English text for efficient NLP
Maged S. Al-shaibani
|
Irfan Ahmad
Findings of the Association for Computational Linguistics: EMNLP 2023
In natural language processing (NLP), the representation of text plays a crucial role in various tasks such as language modeling, sentiment analysis, and machine translation. The standard approach is to represent text in the same way as we, as humans, read and write. In this paper, we propose a novel approach to represent text with only consonants which presents a compact representation of English text that offers improved efficiency without sacrificing performance. We exploit the fact that consonants are more discriminative than vowels and by representing text using consonants, we can significantly reduce the overall memory and compute footprint required for storing and processing textual data. We present two alternative representations: ‘consonants-only’, where we completely remove the vowels from the text, and ‘masked-vowels’, where we mask all the vowels into one special symbol. To evaluate our approaches, we conducted experiments on various NLP tasks, including text classification, part-of-speech (POS) tagging, named-entity recognition (NER), and neural machine translation (NMT), in addition to language modeling. Our results demonstrate that the proposed consonant-based representation achieves comparable performance compared to the standard text representation while requiring significantly fewer computational resources. Furthermore, we show that our representation can be seamlessly integrated with existing NLP models and frameworks, providing a practical solution for efficient text processing. Last but not the least, we present a technique to retrieve the vowel information from our processed text representation keeping in mind the need to reproduce text in human readable form in some NLP applications.
2022
pdf
bib
abs
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
Stephen H. Bach
|
Victor Sanh
|
Zheng-Xin Yong
|
Albert Webson
|
Colin Raffel
|
Nihal V. Nayak
|
Abheesht Sharma
|
Taewoon Kim
|
M Saiful Bari
|
Thibault Fevry
|
Zaid Alyafeai
|
Manan Dey
|
Andrea Santilli
|
Zhiqing Sun
|
Srulik Ben-David
|
Canwen Xu
|
Gunjan Chhablani
|
Han Wang
|
Jason Alan Fries
|
Maged S. Al-shaibani
|
Shanya Sharma
|
Urmish Thakker
|
Khalid Almubarak
|
Xiangru Tang
|
Dragomir Radev
|
Mike Tian-Jian Jiang
|
Alexander M. Rush
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
PromptSource is a system for creating, sharing, and using natural language prompts. Prompts are functions that map an example from a dataset to a natural language input and target output. Using prompts to train and query language models is an emerging area in NLP that requires new tools that let users develop and refine these prompts collaboratively. PromptSource addresses the emergent challenges in this new setting with (1) a templating language for defining data-linked prompts, (2) an interface that lets users quickly iterate on prompt development by observing outputs of their prompts on many examples, and (3) a community-driven set of guidelines for contributing new prompts to a common pool. Over 2,000 prompts for roughly 170 datasets are already available in PromptSource. PromptSource is available at
https://github.com/bigscience-workshop/promptsource.
pdf
bib
abs
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Zaid Alyafeai
|
Maraim Masoud
|
Mustafa Ghaleb
|
Maged S. Al-shaibani
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper, we create Masader, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, we develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.
2020
pdf
bib
abs
ARBML: Democratizing Arabic Natural Language Processing Tools
Zaid Alyafeai
|
Maged S. Al-Shaibani
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)
Automating natural language understanding is a lifelong quest addressed for decades. With the help of advances in machine learning and particularly, deep learning, we are able to produce state of the art models that can imitate human interactions with languages. Unfortunately, these advances are controlled by the availability of language resources. Arabic advances in this field , although it has a great potential, are still limited. This is apparent in both research and development. In this paper, we showcase some NLP models we trained for Arabic. We also present our methodology and pipeline to build such models from data collection, data preprocessing, tokenization and model deployment. These tools help in the advancement of the field and provide a systematic approach for extending NLP tools to many languages.