Houda Bouamor

2021

pdf bib
An Exploration of Automatic Text Summarization of Financial Reports
Samir Abdaljalil | Houda Bouamor
Proceedings of the Third Workshop on Financial Technology and Natural Language Processing

pdf bib abs
The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models
Go Inoue | Bashar Alhafni | Nurpeiis Baimukan | Houda Bouamor | Nizar Habash
Proceedings of the Sixth Arabic Natural Language Processing Workshop

In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.

pdf bib abs
NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task
Muhammad Abdul-Mageed | Chiyu Zhang | AbdelRahim Elmadany | Houda Bouamor | Nizar Habash
Proceedings of the Sixth Arabic Natural Language Processing Workshop

We present the findings and results of theSecond Nuanced Arabic Dialect IdentificationShared Task (NADI 2021). This Shared Taskincludes four subtasks: country-level ModernStandard Arabic (MSA) identification (Subtask1.1), country-level dialect identification (Subtask1.2), province-level MSA identification (Subtask2.1), and province-level sub-dialect identifica-tion (Subtask 2.2). The shared task dataset cov-ers a total of 100 provinces from 21 Arab coun-tries, collected from the Twitter domain. A totalof 53 teams from 23 countries registered to par-ticipate in the tasks, thus reflecting the interestof the community in this area. We received 16submissions for Subtask 1.1 from five teams, 27submissions for Subtask 1.2 from eight teams,12 submissions for Subtask 2.1 from four teams,and 13 Submissions for subtask 2.2 from fourteams.

2020

pdf bib abs
NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task
Muhammad Abdul-Mageed | Chiyu Zhang | Houda Bouamor | Nizar Habash
Proceedings of the Fifth Arabic Natural Language Processing Workshop

We present the results and findings of the First Nuanced Arabic Dialect Identification Shared Task (NADI). This Shared Task includes two subtasks: country-level dialect identification (Subtask 1) and province-level sub-dialect identification (Subtask 2). The data for the shared task covers a total of 100 provinces from 21 Arab countries and is collected from the Twitter domain. As such, NADI is the first shared task to target naturally-occurring fine-grained dialectal text at the sub-country level. A total of 61 teams from 25 countries registered to participate in the tasks, thus reflecting the interest of the community in this area. We received 47 submissions for Subtask 1 from 18 teams and 9 submissions for Subtask 2 from 9 teams.

pdf bib abs
Gender-Aware Reinflection using Linguistically Enhanced Neural Models
Bashar Alhafni | Nizar Habash | Houda Bouamor
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

In this paper, we present an approach for sentence-level gender reinflection using linguistically enhanced sequence-to-sequence models. Our system takes an Arabic sentence and a given target gender as input and generates a gender-reinflected sentence based on the target gender. We formulate the problem as a user-aware grammatical error correction task and build an encoder-decoder architecture to jointly model reinflection for both masculine and feminine grammatical genders. We also show that adding linguistic features to our model leads to better reinflection results. The results on a blind test set using our best system show improvements over previous work, with a 3.6% absolute increase in M2 F0.5.

pdf bib abs
A Spelling Correction Corpus for Multiple Arabic Dialects
Fadhl Eryani | Nizar Habash | Houda Bouamor | Salam Khalifa
Proceedings of the 12th Language Resources and Evaluation Conference

Arabic dialects are the non-standard varieties of Arabic commonly spoken – and increasingly written on social media – across the Arab world. Arabic dialects do not have standard orthographies, a challenge for natural language processing applications. In this paper, we present the MADAR CODA Corpus, a collection of 10,000 sentences from five Arabic city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) represented in the Conventional Orthography for Dialectal Arabic (CODA) in parallel with their raw original form. The sentences come from the Multi-Arabic Dialect Applications and Resources (MADAR) Project and are in parallel across the cities (2,000 sentences from each city). This publicly available resource is intended to support research on spelling correction and text normalization for Arabic dialects. We present results on a bootstrapping technique we use to speed up the CODA annotation, as well as on the degree of similarity across the dialects before and after CODA annotation.

2019

pdf bib abs
Automatic Gender Identification and Reinflection in Arabic
Nizar Habash | Houda Bouamor | Christine Chung
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

The impressive progress in many Natural Language Processing (NLP) applications has increased the awareness of some of the biases these NLP systems have with regards to gender identities. In this paper, we propose an approach to extend biased single-output gender-blind NLP systems with gender-specific alternative reinflections. We focus on Arabic, a gender-marking morphologically rich language, in the context of machine translation (MT) from English, and for first-person-singular constructions only. Our contributions are the development of a system-independent gender-awareness wrapper, and the building of a corpus for training and evaluating first-person-singular gender identification and reinflection in Arabic. Our results successfully demonstrate the viability of this approach with 8% relative increase in Bleu score for first-person-singular feminine, and 5.3% comparable increase for first-person-singular masculine on top of a state-of-the-art gender-blind MT system on a held-out test set.

pdf bib abs
A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance
Alexander Erdmann | Salam Khalifa | Mai Oudah | Nizar Habash | Houda Bouamor
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. In two evaluations, we consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.

pdf bib abs
The MADAR Shared Task on Arabic Fine-Grained Dialect Identification
Houda Bouamor | Sabit Hassan | Nizar Habash
Proceedings of the Fourth Arabic Natural Language Processing Workshop

In this paper, we present the results and findings of the MADAR Shared Task on Arabic Fine-Grained Dialect Identification. This shared task was organized as part of The Fourth Arabic Natural Language Processing Workshop, collocated with ACL 2019. The shared task includes two subtasks: the MADAR Travel Domain Dialect Identification subtask (Subtask 1) and the MADAR Twitter User Dialect Identification subtask (Subtask 2). This shared task is the first to target a large set of dialect labels at the city and country levels. The data for the shared task was created or collected under the Multi-Arabic Dialect Applications and Resources (MADAR) project. A total of 21 teams from 15 countries participated in the shared task.

pdf bib
The FinSBD-2019 Shared Task: Sentence Boundary Detection in PDF Noisy Text in the Financial Domain
Abderrahim Ait Azzi | Houda Bouamor | Sira Ferradans
Proceedings of the First Workshop on Financial Technology and Natural Language Processing

pdf bib
Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019)
Mahmoud El-Haj | Paul Rayson | Steven Young | Houda Bouamor | Sira Ferradans
Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019)

pdf bib abs
ADIDA: Automatic Dialect Identification for Arabic
Ossama Obeid | Mohammad Salameh | Houda Bouamor | Nizar Habash
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

This demo paper describes ADIDA, a web-based system for automatic dialect identification for Arabic text. The system distinguishes among the dialects of 25 Arab cities (from Rabat to Muscat) in addition to Modern Standard Arabic. The results are presented with either a point map or a heat map visualizing the automatic identification probabilities over a geographical map of the Arab World.

2018

pdf bib abs
Fine-Grained Arabic Dialect Identification
Mohammad Salameh | Houda Bouamor | Nizar Habash
Proceedings of the 27th International Conference on Computational Linguistics

Previous work on the problem of Arabic Dialect Identification typically targeted coarse-grained five dialect classes plus Standard Arabic (6-way classification). This paper presents the first results on a fine-grained dialect classification task covering 25 specific cities from across the Arab World, in addition to Standard Arabic – a very challenging task. We build several classification systems and explore a large space of features. Our results show that we can identify the exact city of a speaker at an accuracy of 67.9% for sentences with an average length of 7 words (a 9% relative error reduction over the state-of-the-art technique for Arabic dialect identification) and reach more than 90% when we consider 16 words. We also report on additional insights from a data analysis of similarity and difference across Arabic dialects.

pdf bib
MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction
Ossama Obeid | Salam Khalifa | Nizar Habash | Houda Bouamor | Wajdi Zaghouani | Kemal Oflazer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

2016

pdf bib
Eyes Don’t Lie: Predicting Machine Translation Quality Using Eye Movement
Hassan Sajjad | Francisco Guzmán | Nadir Durrani | Ahmed Abdelali | Houda Bouamor | Irina Temnikova | Stephan Vogel
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Arabic writing is typically underspecified for short vowels and other markups, referred to as diacritics. In addition to the lexical ambiguity exhibited in most languages, the lack of diacritics in written Arabic adds another layer of ambiguity which is an artifact of the orthography. In this paper, we present the details of three annotation experimental conditions designed to study the impact of automatic ambiguity detection, on annotation speed and quality in a large scale annotation project.

pdf bib abs
Machine Translation Evaluation for Arabic using Morphologically-enriched Embeddings
Francisco Guzmán | Houda Bouamor | Ramy Baly | Nizar Habash
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Evaluation of machine translation (MT) into morphologically rich languages (MRL) has not been well studied despite posing many challenges. In this paper, we explore the use of embeddings obtained from different levels of lexical and morpho-syntactic linguistic analysis and show that they improve MT evaluation into an MRL. Specifically we report on Arabic, a language with complex and rich morphology. Our results show that using a neural-network model with different input representations produces results that clearly outperform the state-of-the-art for MT evaluation into Arabic, by almost over 75% increase in correlation with human judgments on pairwise MT evaluation quality task. More importantly, we demonstrate the usefulness of morpho-syntactic representations to model sentence similarity for MT evaluation and address complex linguistic phenomena of Arabic.

pdf bib abs
DALILA: The Dialectal Arabic Linguistic Learning Assistant
Salam Khalifa | Houda Bouamor | Nizar Habash
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Dialectal Arabic (DA) poses serious challenges for Natural Language Processing (NLP). The number and sophistication of tools and datasets in DA are very limited in comparison to Modern Standard Arabic (MSA) and other languages. MSA tools do not effectively model DA which makes the direct use of MSA NLP tools for handling dialects impractical. This is particularly a challenge for the creation of tools to support learning Arabic as a living language on the web, where authentic material can be found in both MSA and DA. In this paper, we present the Dialectal Arabic Linguistic Learning Assistant (DALILA), a Chrome extension that utilizes cutting-edge Arabic dialect NLP research to assist learners and non-native speakers in understanding text written in either MSA or DA. DALILA provides dialectal word analysis and English gloss corresponding to each word.

pdf bib abs
Building an Arabic Machine Translation Post-Edited Corpus: Guidelines and Annotation
Wajdi Zaghouani | Nizar Habash | Ossama Obeid | Behrang Mohit | Houda Bouamor | Kemal Oflazer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic. Our overarching goal is to use the annotated corpus to develop automatic machine translation post-editing systems for Arabic that can be used to help accelerate the human revision process of translated texts. The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter-annotator agreement measures were performed to check the annotation quality. The created corpus of manual post-edited translations of English to Arabic articles is the largest to date for this language pair.

This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations.

2015

pdf bib
QCMUQ@QALB-2015 Shared Task: Combining Character level MT and Error-tolerant Finite-State Recognition for Arabic Spelling Correction
Houda Bouamor | Hassan Sajjad | Nadir Durrani | Kemal Oflazer
Proceedings of the Second Workshop on Arabic Natural Language Processing

pdf bib
UMMU@QALB-2015 Shared Task: Character and Word level SMT pipeline for Automatic Error Correction of Arabic Text
Fethi Bougares | Houda Bouamor
Proceedings of the Second Workshop on Arabic Natural Language Processing

2014

pdf bib abs
A Multidialectal Parallel Corpus of Arabic
Houda Bouamor | Nizar Habash | Kemal Oflazer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2,000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

pdf bib abs
YouDACC: the Youtube Dialectal Arabic Comment Corpus
Ahmed Salama | Houda Bouamor | Behrang Mohit | Kemal Oflazer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents YOUDACC, an automatically annotated large-scale multi-dialectal Arabic corpus collected from user comments on Youtube videos. Our corpus covers different groups of dialects: Egyptian (EG), Gulf (GU), Iraqi (IQ), Maghrebi (MG) and Levantine (LV). We perform an empirical analysis on the crawled corpus and demonstrate that our location-based proposed method is effective for the task of dialect labeling.

pdf bib
CMUQ@QALB-2014: An SMT-based System for Automatic Arabic Error Correction
Serena Jeblee | Houda Bouamor | Wajdi Zaghouani | Kemal Oflazer
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
CMUQ-Hybrid: Sentiment Classification By Feature Engineering and Parameter Tuning
Kamla Al-Mannai | Hanan Alshikhabobakr | Sabih Bin Wasi | Rukhsar Neyaz | Houda Bouamor | Behrang Mohit
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
CMUQ@Qatar:Using Rich Lexical Features for Sentiment Analysis on Twitter
Sabih Bin Wasi | Rukhsar Neyaz | Houda Bouamor | Behrang Mohit
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
A Human Judgement Corpus and a Metric for Arabic MT Evaluation
Houda Bouamor | Hanan Alshikhabobakr | Behrang Mohit | Kemal Oflazer
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Dudley North visits North London: Learning When to Transliterate to Arabic
Mahmoud Azab | Houda Bouamor | Behrang Mohit | Kemal Oflazer
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
SuMT: A Framework of Summarization and MT
Houda Bouamor | Behrang Mohit | Kemal Oflazer
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib
Validation sur le Web de reformulations locales: application à la Wikipédia (Assisted Rephrasing for Wikipedia Contributors through Web-based Validation) [in French]
Houda Bouamor | Aurélien Max | Gabriel Illouz | Anne Vilnat
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Une étude en 3D de la paraphrase: types de corpus, langues et techniques (A Study of Paraphrase along 3 Dimensions : Corpus Types, Languages and Techniques) [in French]
Houda Bouamor | Aurélien Max | Anne Vilnat
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs
Aurélien Max | Houda Bouamor | Anne Vilnat
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Validation of sub-sentential paraphrases acquired from parallel monolingual corpora
Houda Bouamor | Aurélien Max | Anne Vilnat
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs
A contrastive review of paraphrase acquisition techniques
Houda Bouamor | Aurélien Max | Gabriel Illouz | Anne Vilnat
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper addresses the issue of what approach should be used for building a corpus of sententential paraphrases depending on one's requirements. Six strategies are studied: (1) multiple translations into a single language from another language; (2) multiple translations into a single language from different other languages; (3) multiple descriptions of short videos; (4) multiple subtitles for the same language; (5) headlines for similar news articles; and (6) sub-sentential paraphrasing in the context of a Web-based game. We report results on French for 50 paraphrase pairs collected for all these strategies, where corpora were manually aligned at the finest possible level to define oracle performance in terms of accessible sub-sentential paraphrases. The differences observed will be used as criteria for motivating the choice of a given approach before attempting to build a new paraphrase corpus.

2011

pdf bib
Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs
Houda Bouamor | Aurélien Max | Anne Vilnat
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib abs
Paraphrases et modifications locales dans l’historique des révisions de Wikipédia (Paraphrases and local changes in the revision history of Wikipedia)
Camille Dutrey | Houda Bouamor | Delphine Bernhard | Aurélien Max
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous analysons les modifications locales disponibles dans l’historique des révisions de la version française de Wikipédia. Nous définissons tout d’abord une typologie des modifications fondée sur une étude détaillée d’un large corpus de modifications. Puis, nous détaillons l’annotation manuelle d’une partie de ce corpus afin d’évaluer le degré de complexité de la tâche d’identification automatique de paraphrases dans ce genre de corpus. Enfin, nous évaluons un outil d’identification de paraphrases à base de règles sur un sous-ensemble de notre corpus.

pdf bib abs
Combinaison d’informations pour l’alignement monolingue (Information combination for monolingual alignment)
Houda Bouamor | Aurélien Max | Anne Vilnat
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous décrivons une nouvelle méthode d’alignement automatique de paraphrases d’énoncés. Nous utilisons des méthodes développées précédemment afin de produire différentes approches hybrides (hybridations). Ces différentes méthodes permettent d’acquérir des équivalences textuelles à partir d’un corpus monolingue parallèle. L’hybridation combine des informations obtenues par diverses techniques : alignements statistiques, approche symbolique, fusion d’arbres syntaxiques et alignement basé sur des distances d’édition. Nous avons évalué l’ensemble de ces résultats et nous constatons une amélioration sur l’acquisition de paraphrases sous-phrastiques.

pdf bib
Web-based Validation for Contextual Targeted Paraphrasing
Houda Bouamor | Aurélien Max | Gabriel Illouz | Anne Vilnat
Proceedings of the Workshop on Monolingual Text-To-Text Generation

2010

pdf bib abs
Acquisition de paraphrases sous-phrastiques depuis des paraphrases d’énoncés
Houda Bouamor | Aurélien Max | Anne Vilnat
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous présentons la tâche d’acquisition de paraphrases sous-phrastiques (impliquant des paires de mots ou de groupes de mots), et décrivons plusieurs techniques opérant à différents niveaux. Nous décrivons une évaluation visant à comparer ces techniques et leurs combinaisons sur deux corpus de paraphrases d’énoncés obtenus par traduction multiple. Les conclusions que nous tirons peuvent servir de guide pour améliorer des techniques existantes.

pdf bib abs
Construction d’un corpus de paraphrases d’énoncés par traduction multiple multilingue
Houda Bouamor
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

Les corpus de paraphrases à large échelle sont importants dans de nombreuses applications de TAL. Dans cet article nous présentons une méthode visant à obtenir un corpus parallèle de paraphrases d’énoncés en français. Elle vise à collecter des traductions multiples proposées par des contributeurs volontaires francophones à partir de plusieurs langues européennes. Nous formulons l’hypothèse que deux traductions soumises indépendamment par deux participants conservent généralement le sens de la phrase d’origine, quelle que soit la langue à partir de laquelle la traduction est effectuée. L’analyse des résultats nous permet de discuter cette hypothèse.

2009

pdf bib abs
Amener des utilisateurs à créer et évaluer des paraphrases par le jeu
Houda Bouamor | Aurélien Max | Anne Vilnat
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Dans cet article, nous présentons une application sur le web pour l’acquisition de paraphrases phrastiques et sous-phrastiques sous forme de jeu. L’application permet l’acquisition à la fois de paraphrases et de jugements humains multiples sur ces paraphrases, ce qui constitue des données particulièrement utiles pour les applications du TAL basées sur les phénomènes paraphrastiques.