2024
pdf
abs
Arabic Diacritization Using Morphologically Informed Character-Level Model
Muhammad Morsy Elmallah
|
Mahmoud Reda
|
Kareem Darwish
|
Abdelrahman El-Sheikh
|
Ashraf Hatim Elneima
|
Murtadha Aljubran
|
Nouf Alsaeed
|
Reem Mohammed
|
Mohamed Al-Badrashiny
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Arabic diacritic recovery i.e. diacritization is necessary for proper vocalization and an enabler for downstream applications such as language learning and text to speech. Diacritics come in two varieties, namely: core-word diacritics and case endings. In this paper we introduce a highly effective morphologically informed character-level model that can recover both types of diacritics simultaneously. The model uses a Recurrent Neural Network (RNN) based architecture that takes in text as a sequence of characters, with markers for morphological segmentation, and outputs a sequence of diacritics. We also introduce a character-based morphological segmentation model that we train for Modern Standard Arabic (MSA) and dialectal Arabic. We demonstrate the efficacy of our diacritization model on Classical Arabic, MSA, and two dialectal (Moroccan and Tunisian) texts. We achieve the lowest reported word-level diacritization error rate for MSA (3.4%), match the best results for Classical Arabic (5.4%), and report competitive results for dialectal Arabic.
2023
pdf
bib
abs
Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Terminology
Elizabeth Salesky
|
Kareem Darwish
|
Mohamed Al-Badrashiny
|
Mona Diab
|
Jan Niehues
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
We present the ACL 60/60 evaluation sets for multilingual translation of ACL 2022 technical presentations into 10 target languages. This dataset enables further research into multilingual speech translation under realistic recording conditions with unsegmented audio and domain-specific terminology, applying NLP tools to text and speech in the technical domain, and evaluating and improving model robustness to diverse speaker demographics.
pdf
abs
EvolveMT: an Ensemble MT Engine Improving Itself with Usage Only
Kamer Yüksel
|
Ahmet Gunduz
|
Mohamed Al-badrashiny
|
Hassan Sawaf
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
This work proposes a method named EvolveMT for the efficient combination of multiple machine translation (MT) engines. The method selects the output from one engine for each segment, using online learning techniques to predict the most appropriate system for each translation request. A neural quality estimation metric supervises the method without requiring reference translations. The method’s online learning capability enables it to adapt to changes in the domain or MT engines dynamically, eliminating the requirement for retraining. The method selects a subset of translation engines to be called based on the source sentence features. The degree of exploration is configurable according to the desired quality-cost trade-off. Results from custom datasets demonstrate that EvolveMT achieves similar translation accuracy at a lower cost than selecting the best translation of each segment from all translations using an MT quality estimator. To the best of our knowledge, EvolveMT is the first MT system that adapts itself after deployment to incoming translation requests from the production environment without needing costly retraining on human feedback.
2022
pdf
abs
MTLens: Machine Translation Output Debugging
Shreyas Sharma
|
Kareem Darwish
|
Lucas Pavanelli
|
Thiago Castro Ferreira
|
Mohamed Al-Badrashiny
|
Kamer Ali Yuksel
|
Hassan Sawaf
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The performance of Machine Translation (MT) systems varies significantly with inputs of diverging features such as topics, genres, and surface properties. Though there are many MT evaluation metrics that generally correlate with human judgments, they are not directly useful in identifying specific shortcomings of MT systems. In this demo, we present a benchmarking interface that enables improved evaluation of specific MT systems in isolation or multiple MT systems collectively by quantitatively evaluating their performance on many tasks across multiple domains and evaluation metrics. Further, it facilitates effective debugging and error analysis of MT output via the use of dynamic filters that help users hone in on problem sentences with specific properties, such as genre, topic, sentence length, etc. The interface can be extended to include additional filters such as lexical, morphological, and syntactic features. Aside from helping debug MT output, it can also help in identifying problems in reference translations and evaluation metrics.
pdf
abs
aiXplain at Arabic Hate Speech 2022: An Ensemble Based Approach to Detecting Offensive Tweets
Salaheddin Alzubi
|
Thiago Castro Ferreira
|
Lucas Pavanelli
|
Mohamed Al-Badrashiny
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
Abusive speech on online platforms has a detrimental effect on users’ mental health. This warrants the need for innovative solutions that automatically moderate content, especially on online platforms such as Twitter where a user’s anonymity is loosely controlled. This paper outlines aiXplain Inc.’s ensemble based approach to detecting offensive speech in the Arabic language based on OSACT5’s shared sub-task A. Additionally, this paper highlights multiple challenges that may hinder progress on detecting abusive speech and provides potential avenues and techniques that may lead to significant progress.
pdf
abs
Gulf Arabic Diacritization: Guidelines, Initial Dataset, and Results
Nouf Alabbasi
|
Mohamed Al-Badrashiny
|
Maryam Aldahmani
|
Ahmed AlDhanhani
|
Abdullah Saleh Alhashmi
|
Fawaghy Ahmed Alhashmi
|
Khalid Al Hashemi
|
Rama Emad Alkhobbi
|
Shamma T Al Maazmi
|
Mohammed Ali Alyafeai
|
Mariam M Alzaabi
|
Mohamed Saqer Alzaabi
|
Fatma Khalid Badri
|
Kareem Darwish
|
Ehab Mansour Diab
|
Muhammad Morsy Elmallah
|
Amira Ayman Elnashar
|
Ashraf Hatim Elneima
|
MHD Tameem Kabbani
|
Nour Rabih
|
Ahmad Saad
|
Ammar Mamoun Sousou
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Arabic diacritic recovery is important for a variety of downstream tasks such as text-to-speech. In this paper, we introduce a new Gulf Arabic diacritization dataset composed of 19,850 words based on a subset of the Gumar corpus. We provide comprehensive set of guidelines for diacritization to enable the diacritization of more data. We also report on diacritization results based on the new corpus using a Hidden Markov Model and character-based sequence to sequence models.
2017
pdf
abs
A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic
Mohamed Al-Badrashiny
|
Abdelati Hawwari
|
Mona Diab
Proceedings of the Third Arabic Natural Language Processing Workshop
In this paper we present a system for automatic Arabic text diacritization using three levels of analysis granularity in a layered back off manner. We build and exploit diacritized language models (LM) for each of three different levels of granularity: surface form, morphologically segmented into prefix/stem/suffix, and character level. For each of the passes, we use Viterbi search to pick the most probable diacritization per word in the input. We start with the surface form LM, followed by the morphological level, then finally we leverage the character level LM. Our system outperforms all of the published systems evaluated against the same training and test data. It achieves a 10.87% WER for complete full diacritization including lexical and syntactic diacritization, and 3.0% WER for lexical diacritization, ignoring syntactic diacritization.
2016
pdf
abs
Automatic Verification and Augmentation of Multilingual Lexicons
Maryam Aminian
|
Mohamed Al-Badrashiny
|
Mona Diab
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
We present an approach for automatic verification and augmentation of multilingual lexica. We exploit existing parallel and monolingual corpora to extract multilingual correspondents via tri-angulation. We demonstrate the efficacy of our approach on two publicly available resources: Tharwa, a three-way lexicon comprising Dialectal Arabic, Modern Standard Arabic and English lemmas among other information (Diab et al., 2014); and BabelNet, a multilingual thesaurus comprising over 276 languages including Arabic variant entries (Navigli and Ponzetto, 2012). Our automated approach yields an F1-score of 71.71% in generating correct multilingual correspondents against gold Tharwa, and 54.46% against gold BabelNet without any human intervention.
pdf
abs
SAMER: A Semi-Automatically Created Lexical Resource for Arabic Verbal Multiword Expressions Tokens Paradigm and their Morphosyntactic Features
Mohamed Al-Badrashiny
|
Abdelati Hawwari
|
Mahmoud Ghoneim
|
Mona Diab
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)
Although MWE are relatively morphologically and syntactically fixed expressions, several types of flexibility can be observed in MWE, verbal MWE in particular. Identifying the degree of morphological and syntactic flexibility of MWE is very important for many Lexicographic and NLP tasks. Adding MWE variants/tokens to a dictionary resource requires characterizing the flexibility among other morphosyntactic features. Carrying out the task manually faces several challenges since it is a very laborious task time and effort wise, as well as it will suffer from coverage limitation. The problem is exacerbated in rich morphological languages where the average word in Arabic could have 12 possible inflection forms. Accordingly, in this paper we introduce a semi-automatic Arabic multiwords expressions resource (SAMER). We propose an automated method that identifies the morphological and syntactic flexibility of Arabic Verbal Multiword Expressions (AVMWE). All observed morphological variants and syntactic pattern alternations of an AVMWE are automatically acquired using large scale corpora. We look for three morphosyntactic aspects of AVMWE types investigating derivational and inflectional variations and syntactic templates, namely: 1) inflectional variation (inflectional paradigm) and calculating degree of flexibility; 2) derivational productivity; and 3) identifying and classifying the different syntactic types. We build a comprehensive list of AVMWE. Every token in the AVMWE list is lemmatized and tagged with POS information. We then search Arabic Gigaword and All ATBs for all possible flexible matches. For each AVMWE type we generate: a) a statistically ranked list of MWE-lexeme inflections and syntactic pattern alternations; b) An abstract syntactic template; and c) The most frequent form. Our technique is validated using a Golden MWE annotated list. The results shows that the quality of the generated resource is 80.04%.
pdf
The George Washington University System for the Code-Switching Workshop Shared Task 2016
Mohamed Al-Badrashiny
|
Mona Diab
Proceedings of the Second Workshop on Computational Approaches to Code Switching
pdf
abs
SPLIT: Smart Preprocessing (Quasi) Language Independent Tool
Mohamed Al-Badrashiny
|
Arfath Pasha
|
Mona Diab
|
Nizar Habash
|
Owen Rambow
|
Wael Salloum
|
Ramy Eskander
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.
pdf
abs
Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data
Mona Diab
|
Mahmoud Ghoneim
|
Abdelati Hawwari
|
Fahad AlGhamdi
|
Nada AlMarwani
|
Mohamed Al-Badrashiny
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present our effort to create a large Multi-Layered representational repository of Linguistic Code-Switched Arabic data. The process involves developing clear annotation standards and Guidelines, streamlining the annotation process, and implementing quality control measures. We used two main protocols for annotation: in-lab gold annotations and crowd sourcing annotations. We developed a web-based annotation tool to facilitate the management of the annotation process. The current version of the repository contains a total of 886,252 tokens that are tagged into one of sixteen code-switching tags. The data exhibits code switching between Modern Standard Arabic and Egyptian Dialectal Arabic representing three data genres: Tweets, commentaries, and discussion fora. The overall Inter-Annotator Agreement is 93.1%.
pdf
abs
LILI: A Simple Language Independent Approach for Language Identification
Mohamed Al-Badrashiny
|
Mona Diab
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
We introduce a generic Language Independent Framework for Linguistic Code Switch Point Detection. The system uses characters level 5-grams and word level unigram language models to train a conditional random fields (CRF) model for classifying input words into various languages. We test our proposed framework and compare it to the state-of-the-art published systems on standard data sets from several language pairs: English-Spanish, Nepali-English, English-Hindi, Arabizi (Refers to Arabic written using the Latin/Roman script)-English, Arabic-Engari (Refers to English written using Arabic script), Modern Standard Arabic(MSA)-Egyptian, Levantine-MSA, Gulf-MSA, one more English-Spanish, and one more MSA-EGY. The overall weighted average F-score of each language pair are 96.4%, 97.3%, 98.0%, 97.0%, 98.9%, 86.3%, 88.2%, 90.6%, 95.2%, and 85.0% respectively. The results show that our approach despite its simplicity, either outperforms or performs at comparable levels to state-of-the-art published systems.
2015
pdf
AIDA2: A Hybrid Approach for Token and Sentence Level Dialect Identification in Arabic
Mohamed Al-Badrashiny
|
Heba Elfardy
|
Mona Diab
Proceedings of the Nineteenth Conference on Computational Natural Language Learning
pdf
GWU-HASP-2015@QALB-2015 Shared Task: Priming Spelling Candidates with Probability
Mohammed Attia
|
Mohamed Al-Badrashiny
|
Mona Diab
Proceedings of the Second Workshop on Arabic Natural Language Processing
2014
pdf
Automatic Transliteration of Romanized Dialectal Arabic
Mohamed Al-Badrashiny
|
Ramy Eskander
|
Nizar Habash
|
Owen Rambow
Proceedings of the Eighteenth Conference on Computational Natural Language Learning
pdf
GWU-HASP: Hybrid Arabic Spelling and Punctuation Corrector
Mohammed Attia
|
Mohamed Al-Badrashiny
|
Mona Diab
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)
pdf
bib
Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script
Ramy Eskander
|
Mohamed Al-Badrashiny
|
Nizar Habash
|
Owen Rambow
Proceedings of the First Workshop on Computational Approaches to Code Switching
pdf
AIDA: Identifying Code Switching in Informal Arabic Text
Heba Elfardy
|
Mohamed Al-Badrashiny
|
Mona Diab
Proceedings of the First Workshop on Computational Approaches to Code Switching
pdf
abs
Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon
Mona Diab
|
Mohamed Al-Badrashiny
|
Maryam Aminian
|
Mohammed Attia
|
Heba Elfardy
|
Nizar Habash
|
Abdelati Hawwari
|
Wael Salloum
|
Pradeep Dasigi
|
Ramy Eskander
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standard Arabic and English correspondents. The paper focuses on Egyptian Arabic as the first pilot dialect for the resource, with plans to expand to other dialects of Arabic in later phases of the project. We describe Tharwas creation process and report on its current status. The lexical entries are augmented with various elements of linguistic information such as POS, gender, rationality, number, and root and pattern information. The lexicon is based on a compilation of information from both monolingual and bilingual existing resources such as paper dictionaries and electronic, corpus-based dictionaries. Multiple levels of quality checks are performed on the output of each step in the creation process. The importance of this lexicon lies in the fact that it is the first resource of its kind bridging multiple variants of Arabic with English. Furthermore, it is a wide coverage lexical resource containing over 73,000 Egyptian entries. Tharwa is publicly available. We believe it will have a significant impact on both Theoretical Linguistics as well as Computational Linguistics research.
pdf
abs
MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic
Arfath Pasha
|
Mohamed Al-Badrashiny
|
Mona Diab
|
Ahmed El Kholy
|
Ramy Eskander
|
Nizar Habash
|
Manoj Pooleery
|
Owen Rambow
|
Ryan Roth
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see
http://nlp.ldeo.columbia.edu/madamira/) that highlights these aspects.
2008
pdf
abs
A Compact Arabic Lexical Semantics Language Resource Based on the Theory of Semantic Fields
Mohamed Attia
|
Mohsen Rashwan
|
Ahmed Ragheb
|
Mohamed Al-Badrashiny
|
Husein Al-Basoumy
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Applications of statistical Arabic NLP in general, and text mining in specific, along with the tools underneath perform much better as the statistical processing operates on deeper language factorization(s) than on raw text. Lexical semantic factorization is very important in that aspect due to its feasibility, high level of abstraction, and the language independence of its output. In the core of such a factorization lies an Arabic lexical semantic DB. While building this LR, we had to go beyond the conventional exclusive collection of words from dictionaries and thesauri that cannot alone produce a satisfactory coverage of this highly inflective and derivative language. This paper is hence devoted to the design and implementation of an Arabic lexical semantics LR that enables the retrieval of the possible senses of any given Arabic word at a high coverage. Instead of tying full Arabic words to their possible senses, our LR flexibly relates morphologically and PoS-tags constrained Arabic lexical compounds to a predefined limited set of semantic fields across which the standard semantic relations are defined. With the aid of the same large-scale Arabic morphological analyzer and PoS tagger in the runtime, the possible senses of virtually any given Arabic word are retrievable.