Saméh Kchaou


2020

pdf bib
Parallel resources for Tunisian Arabic Dialect Translation
Saméh Kchaou | Rahma Boujelbane | Lamia Hadrich-Belguith
Proceedings of the Fifth Arabic Natural Language Processing Workshop

The difficulty of processing dialects is clearly observed in the high cost of building representative corpus, in particular for machine translation. Indeed, all machine translation systems require a huge amount and good management of training data, which represents a challenge in a low-resource setting such as the Tunisian Arabic dialect. In this paper, we present a data augmentation technique to create a parallel corpus for Tunisian Arabic dialect written in social media and standard Arabic in order to build a Machine Translation (MT) model. The created corpus was used to build a sentence-based translation model. This model reached a BLEU score of 15.03% on a test set, while it was limited to 13.27% utilizing the corpus without augmentation.

pdf bib
Text and Speech-based Tunisian Arabic Sub-Dialects Identification
Najla Ben Abdallah | Saméh Kchaou | Fethi Bougares
Proceedings of the 12th Language Resources and Evaluation Conference

Dialect IDentification (DID) is a challenging task, and it becomes more complicated when it is about the identification of dialects that belong to the same country. Indeed, dialects of the same country are closely related and exhibit a significant overlapping at the phonetic and lexical levels. In this paper, we present our first results on a dialect classification task covering four sub-dialects spoken in Tunisia. We use the term ’sub-dialect’ to refer to the dialects belonging to the same country. We conducted our experiments aiming to discriminate between Tunisian sub-dialects belonging to four different cities: namely Tunis, Sfax, Sousse and Tataouine. A spoken corpus of 1673 utterances is collected, transcribed and freely distributed. We used this corpus to build several speech- and text-based DID systems. Our results confirm that, at this level of granularity, dialects are much better distinguishable using the speech modality. Indeed, we were able to reach an F-1 score of 93.75% using our best speech-based identification system while the F-1 score is limited to 54.16% using text-based DID on the same test set.

2019

pdf bib
LIUM-MIRACL Participation in the MADAR Arabic Dialect Identification Shared Task
Saméh Kchaou | Fethi Bougares | Lamia Hadrich-Belguith
Proceedings of the Fourth Arabic Natural Language Processing Workshop

This paper describes the joint participation of the LIUM and MIRACL Laboratories at the Arabic dialect identification challenge of the MADAR Shared Task (Bouamor et al., 2019) conducted during the Fourth Arabic Natural Language Processing Workshop (WANLP 2019). We participated to the Travel Domain Dialect Identification subtask. We built several systems and explored different techniques including conventional machine learning methods and deep learning algorithms. Deep learning approaches did not perform well on this task. We experimented several classification systems and we were able to identify the dialect of an input sentence with an F1-score of 65.41% on the official test set using only the training data supplied by the shared task organizers.