TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus

Elisa Gugliotta, Marco Dinarelli


Abstract
This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and “arithmographs” (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. Arabish differs for each Arabic dialect and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the attention of NLP studies on Arabic dialects has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and its encoding in Tunisian Arabish.
Anthology ID:
2020.lrec-1.770
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6279–6286
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.770
DOI:
Bibkey:
Cite (ACL):
Elisa Gugliotta and Marco Dinarelli. 2020. TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6279–6286, Marseille, France. European Language Resources Association.
Cite (Informal):
TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus (Gugliotta & Dinarelli, LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2020.lrec-1.770.pdf
Data
TArC