Mariem Ellouze Khemekhem

Also published as: Mariem Ellouze, Mariem Ellouze Khemakhem, Mariem Ellouze Khmekhem, Mariem Ellouze khemekhem

2019

pdf abs
Semantic Language Model for Tunisian Dialect
Abir Masmoudi | Rim Laatar | Mariem Ellouze | Lamia Hadrich Belguith
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper, we describe the process of creating a statistical Language Model (LM) for the Tunisian Dialect. Indeed, this work is part of the realization of Automatic Speech Recognition (ASR) system for the Tunisian Railway Transport Network. Since our eld of work has been limited, there are several words with similar behaviors (semantic for example) but they do not have the same appearance probability; their class groupings will therefore be possible. For these reasons, we propose to build an n-class LM that is based mainly on the integration of purely semantic data. Indeed, each class represents an abstraction of similar labels. In order to improve the sequence labeling task, we proposed to use a discriminative algorithm based on the Conditional Random Field (CRF) model. To better judge our choice of creating an n-class word model, we compared the created model with the 3-gram type model on the same test corpus of evaluation. Additionally, to assess the impact of using the CRF model to perform the semantic labelling task in order to construct semantic classes, we compared the n-class created model with using the CRF in the semantic labelling task and the n- class model without using the CRF in the semantic labelling task. The drawn comparison of the predictive power of the n-class model obtained by applying the CRF model in the semantic labelling is that it is better than the other two models presenting the highest value of its perplexity.

pdf abs
Automatic diacritization of Tunisian dialect text using Recurrent Neural Network
Abir Masmoudi | Mariem Ellouze | Lamia Hadrich belguith
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

The absence of diacritical marks in the Arabic texts generally leads to morphological, syntactic and semantic ambiguities. This can be more blatant when one deals with under-resourced languages, such as the Tunisian dialect, which suffers from unavailability of basic tools and linguistic resources, like sufficient amount of corpora, multilingual dictionaries, morphological and syntactic analyzers. Thus, this language processing faces greater challenges due to the lack of these resources. The automatic diacritization of MSA text is one of the various complex problems that can be solved by deep neural networks today. Since the Tunisian dialect is an under-resourced language of MSA and as there are a lot of resemblance between both languages, we suggest to investigate a recurrent neural network (RNN) for this dialect diacritization problem. This model will be compared to our previous models models CRF and SMT (CITATION) based on the same dialect corpus. We can experimentally show that our model can achieve better outcomes (DER of 10.72%), as compared to the two models CRF (DER of 20.25%) and SMT (DER of 33.15%).

2018

pdf
Détection des couples de termes translittérés à partir d’un corpus parallèle anglais-arabe ()
Wafa Neifar | Thierry Hamon | Pierre Zweigenbaum | Mariem Ellouze | Lamia-Hadrich Belguith
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

2016

pdf abs
Impact de l’agglutination dans l’extraction de termes en arabe standard moderne (Adaptation of a term extractor to the Modern Standard Arabic language)
Wafa Neifar | Thierry Hamon | Pierre Zweigenbaum | Mariem Ellouze | Lamia Hadrich Belguith
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Nous présentons, dans cet article, une adaptation à l’arabe standard moderne d’un extracteur de termes pour le français et l’anglais. L’adaptation a d’abord consisté à décrire le processus d’extraction des termes de manière similaire à celui défini pour l’anglais et le français en prenant en compte certains particularités morpho-syntaxiques de la langue arabe. Puis, nous avons considéré le phénomène de l’agglutination de la langue arabe. L’évaluation a été réalisée sur un corpus de textes médicaux. Les résultats montrent que parmi 400 termes candidats maximaux analysés, 288 sont jugés corrects par rapport au domaine (72,1%). Les erreurs d’extraction sont dues à l’étiquetage morpho-syntaxique et à la non-voyellation des textes mais aussi à des phénomènes d’agglutination.

2014

Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum dominated by mixed forms. In this paper, we present a conventional orthography for Tunisian Arabic, following a previous effort on developing a conventional orthography for Dialectal Arabic (or CODA) demonstrated for Egyptian Arabic. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Tunisian Arabic.

pdf abs
A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition
Abir Masmoudi | Mariem Ellouze Khmekhem | Yannick Estève | Lamia Hadrich Belguith | Nizar Habash
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

pdf
De l’arabe standard vers l’arabe dialectal : projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens [From Modern Standard Arabic to Tunisian dialect: corpus projection and linguistic resources towards the automatic processing of speech in the Tunisian media]
Rahma Boujelbane | Mariem Ellouze | Frédéric Béchet | Lamia Belguith
Traitement Automatique des Langues, Volume 55, Numéro 2 : Traitement automatique du langage parlé [Spoken language processing]