Pre-trained Models or Feature Engineering: The Case of Dialectal Arabic

Kathrein Abu Kwaik, Stergios Chatzikyriakidis, Simon Dobnik


Abstract
The usage of social media platforms has resulted in the proliferation of work on Arabic Natural Language Processing (ANLP), including the development of resources. There is also an increased interest in processing Arabic dialects and a number of models and algorithms have been utilised for the purpose of Dialectal Arabic Natural Language Processing (DANLP). In this paper, we conduct a comparison study between some of the most well-known and most commonly used methods in NLP in order to test their performance on different corpora and two NLP tasks: Dialect Identification and Sentiment Analysis. In particular, we compare three general classes of models: a) traditional Machine Learning models with features, b) classic Deep Learning architectures (LSTMs) with pre-trained word embeddings and lastly c) different Bidirectional Encoder Representations from Transformers (BERT) models such as (Multilingual-BERT, Ara-BERT, and Twitter-Arabic-BERT). The results of the comparison show that using feature-based classification can still compete with BERT models in these dialectal Arabic contexts. The use of transformer models have the ability to outperform traditional Machine Learning approaches, depending on the type of text they have been trained on, in contrast to classic Deep Learning models like LSTMs which do not perform well on the tasks
Anthology ID:
2022.osact-1.5
Volume:
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Hend Al-Khalifa, Tamer Elsayed, Hamdy Mubarak, Abdulmohsen Al-Thubaity, Walid Magdy, Kareem Darwish
Venue:
OSACT
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
41–50
Language:
URL:
https://aclanthology.org/2022.osact-1.5
DOI:
Bibkey:
Cite (ACL):
Kathrein Abu Kwaik, Stergios Chatzikyriakidis, and Simon Dobnik. 2022. Pre-trained Models or Feature Engineering: The Case of Dialectal Arabic. In Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, pages 41–50, Marseille, France. European Language Resources Association.
Cite (Informal):
Pre-trained Models or Feature Engineering: The Case of Dialectal Arabic (Abu Kwaik et al., OSACT 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2022.osact-1.5.pdf
Data
ASTD