2024
pdf
abs
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Minghao Wu
|
Abdul Waheed
|
Chiyu Zhang
|
Muhammad Abdul-Mageed
|
Alham Aji
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) with instruction fine-tuning demonstrate superior generative capabilities. However, these models are resource-intensive. To alleviate this issue, we explore distilling knowledge from instruction-tuned LLMs into much smaller ones. While other similar works have been done, they are often conducted on a limited set of (usually still large) models and are not accompanied by proper evaluations. To this end, we carefully develop a large set of 2.58M instructions based on both existing and newly-generated instructions. In addition to being sizable, we design our instructions to cover a broad set of topics to ensure diversity. Extensive analysis of our instruction dataset confirms its diversity, and we generate responses for these instructions using gpt-3.5-turbo. Leveraging these instructions, we fine-tune a diverse herd of models, collectively referred to as LaMini-LM, which includes models from both the encoder-decoder and decoder-only families, with varying sizes. We evaluate the performance of our models using automatic metrics on 15 different natural language processing (NLP) benchmarks, as well as through human assessment. We also assess the model for hallucination and toxicity, and for the former, we introduce a new benchmark dataset for hallucination-inducing QA. The results demonstrate that our proposed LaMini-LM models are comparable to strong baselines while being much smaller in size.
2023
pdf
abs
TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties
Karima Kadaoui
|
Samar Magdy
|
Abdul Waheed
|
Md Tawkat Islam Khondaker
|
Ahmed El-Shangiti
|
El Moatez Billah Nagoudi
|
Muhammad Abdul-Mageed
Proceedings of ArabicNLP 2023
Despite the purported multilingual proficiency of instruction-finetuned large language models (LLMs) such as ChatGPT and Bard, the linguistic inclusivity of these models remains insufficiently explored. Considering this constraint, we present a thorough assessment of Bard and ChatGPT (encompassing both GPT-3.5 and GPT-4) regarding their machine translation proficiencies across ten varieties of Arabic. Our evaluation covers diverse Arabic varieties such as Classical Arabic (CA), Modern Standard Arabic (MSA), and several country-level dialectal variants. Our analysis indicates that LLMs may encounter challenges with dialects for which minimal public datasets exist, but on average are better translators of dialects than existing commercial systems. On CA and MSA, instruction-tuned LLMs, however, trail behind commercial systems such as Google Translate. Finally, we undertake a human-centric study to scrutinize the efficacy of the relatively recent model, Bard, in following human instructions during translation tasks. Our analysis reveals a circumscribed capability of Bard in aligning with human instructions in translation contexts. Collectively, our findings underscore that prevailing LLMs remain far from inclusive, with only limited ability to cater for the linguistic and cultural intricacies of diverse communities.
pdf
abs
VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System
Abdul Waheed
|
Bashar Talafha
|
Peter Sullivan
|
AbdelRahim Elmadany
|
Muhammad Abdul-Mageed
Proceedings of ArabicNLP 2023
Arabic is a complex language with many varieties and dialects spoken by ~ 450 millions all around the world. Due to the linguistic diversity and vari-ations, it is challenging to build a robust and gen-eralized ASR system for Arabic. In this work, we address this gap by developing and demoing a system, dubbed VoxArabica, for dialect identi-fication (DID) as well as automatic speech recog-nition (ASR) of Arabic. We train a wide range of models such as HuBERT (DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR tasks. Our DID models are trained to identify 17 different dialects in addition to MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data. Additionally, for the re-maining dialects in ASR, we provide the option to choose various models such as Whisper and MMS in a zero-shot setting. We integrate these models into a single web interface with diverse features such as audio recording, file upload, model selec-tion, and the option to raise flags for incorrect out-puts. Overall, we believe VoxArabica will be use-ful for a wide range of audiences concerned with Arabic research. Our system is currently running at https://cdce-206-12-100-168.ngrok.io/.
pdf
abs
GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP
Md Tawkat Islam Khondaker
|
Abdul Waheed
|
El Moatez Billah Nagoudi
|
Muhammad Abdul-Mageed
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
ChatGPT’s emergence heralds a transformative phase in NLP, particularly demonstrated through its excellent performance on many English benchmarks. However, the model’s efficacy across diverse linguistic contexts remains largely uncharted territory. This work aims to bridge this knowledge gap, with a primary focus on assessing ChatGPT’s capabilities on Arabic languages and dialectal varieties. Our comprehensive study conducts a large-scale automated and human evaluation of ChatGPT, encompassing 44 distinct language understanding and generation tasks on over 60 different datasets. To our knowledge, this marks the first extensive performance analysis of ChatGPT’s deployment in Arabic NLP. Our findings indicate that, despite its remarkable performance in English, ChatGPT is consistently surpassed by smaller models that have undergone finetuning on Arabic. We further undertake a meticulous comparison of ChatGPT and GPT-4’s Modern Standard Arabic (MSA) and Dialectal Arabic (DA), unveiling the relative shortcomings of both models in handling Arabic dialects compared to MSA. Although we further explore and confirm the utility of employing GPT-4 as a potential alternative for human evaluation, our work adds to a growing body of research underscoring the limitations of ChatGPT.
2021
pdf
BloomNet: A Robust Transformer based model for Bloom’s Learning Outcome Classification
Abdul Waheed
|
Muskan Goyal
|
Nimisha Mittal
|
Deepak Gupta
|
Ashish Khanna
|
Moolchand Sharma
Proceedings of the 4th International Conference on Natural Language and Speech Processing (ICNLSP 2021)
pdf
abs
Analyzing the Domain Robustness of Pretrained Language Models, Layer by Layer
Abhinav Ramesh Kashyap
|
Laiba Mehnaz
|
Bhavitvya Malik
|
Abdul Waheed
|
Devamanyu Hazarika
|
Min-Yen Kan
|
Rajiv Ratn Shah
Proceedings of the Second Workshop on Domain Adaptation for NLP
The robustness of pretrained language models(PLMs) is generally measured using performance drops on two or more domains. However, we do not yet understand the inherent robustness achieved by contributions from different layers of a PLM. We systematically analyze the robustness of these representations layer by layer from two perspectives. First, we measure the robustness of representations by using domain divergence between two domains. We find that i) Domain variance increases from the lower to the upper layers for vanilla PLMs; ii) Models continuously pretrained on domain-specific data (DAPT)(Gururangan et al., 2020) exhibit more variance than their pretrained PLM counterparts; and that iii) Distilled models (e.g., DistilBERT) also show greater domain variance. Second, we investigate the robustness of representations by analyzing the encoded syntactic and semantic information using diagnostic probes. We find that similar layers have similar amounts of linguistic information for data from an unseen domain.