Proceedings of the International Conference on the Challenge of Arabic for NLP/MT
Processing of Colloquial Arabic is a relatively new area of research, and a number of interesting challenges pertaining to spoken Arabic dialects arise. On the one hand, a whole continuum of Arabic dialects exists, with linguistic differences on phonological, morphological, syntactic, and lexical levels. On the other hand, there are inter-dialectal similarities that need be explored. Furthermore, due to scarcity of dialect-specific linguistic resources and availability of a wide range of resources for Modern Standard Arabic (MSA), it is desirable to explore the possibility of exploiting MSA tools when working on dialects. This paper describes challenges in processing of Colloquial Arabic in the context of language modeling for Automatic Speech Recognition. Using data from Egyptian Colloquial Arabic and MSA, we investigate the question of improving language modeling of Egyptian Arabic with MSA data and resources. As part of the project, we address the problem of linguistic variation between Egyptian Arabic and MSA. To account for differences between MSA and Colloquial Arabic, we experiment with the following techniques of data transformation: morphological simplification (stemming), lexical transductions, and syntactic transformations. While the best performing model remains the one built using only dialectal data, these techniques allow us to obtain an improvement over the baseline MSA model. More specifically, while the effect on perplexity of syntactic transformations is not very significant, stemming of the training and testing data improves the baseline perplexity of the MSA model trained on words by 51%, and lexical transductions yield an 82% perplexity reduction. Although the focus of the present work is on language modeling, we believe the findings of the study will be useful for researchers involved in other areas of processing Arabic dialects, such as parsing and machine translation.
Arabic WordNet is a lexical resource for Modern Standard Arabic based on the widely used Princeton WordNet for English (Fellbaum, 1998). Arabic WordNet (AWN) is based on the design and contents of the universally accepted Princeton WordNet (PWN) and will be mappable straightforwardly onto PWN 2.0 and EuroWordNet (EWN), enabling translation on the lexical level to English and dozens of other languages. We have developed and linked the AWN with the Suggested Upper Merged Ontology (SUMO), where concepts are defined with machine interpretable semantics in first order logic (Niles and Pease, 2001). We have greatly extended the ontology and its set of mappings to provide formal terms and definitions for each synset. The end product would be a linguistic resource with a deep formal semantic foundation that is able to capture the richness of Arabic as described in Elkateb (2005). Tools we have developed as part of this effort include a lexicographer's interface modeled on that used for EuroWordNet, with added facilities for Arabic script, following Black and Elkateb's earlier work (2004). In this paper we describe our methodology for building a lexical resource in Arabic and the challenge of Arabic for lexical resources.
In this paper, we report on several software implementations that we have developed within Prague Arabic Dependency Treebank or some other projects concerned with Arabic Natural Language Processing. We try to guide the reader through some essential tasks and note the solutions that we have designed and used. We as well point to third-party computational systems that the research community might exploit in the future work in this field.
Arabic diacritization (referred to sometimes as vocalization or vowelling), defined as the full or partial representation of short vowels, shadda (consonantal length or germination), tanween (nunation or definiteness), and hamza (the glottal stop and its support letters), is still largely understudied in the current NLP literature. In this paper, the lack of diacritics in standard Arabic texts is presented as a major challenge to most Arabic natural language processing tasks, including parsing. Recent studies (Messaoudi, et al. 2004; Vergyri & Kirchhoff 2004; Zitouni, et al. 2006 and Maamouri, et al. forthcoming) about the place and impact of diacritization in text-based NLP research are presented along with an analysis of the weight of the missing diacritics on Treebank morphological and syntactic analyses and the impact on parser development.
Morphological ambiguity is a major concern for syntactic parsers, POS taggers and other NLP tools. For example, the greater the number of morphological analyses given for a lexical entry, the longer a parser takes in analyzing a sentence, and the greater the number of parses it produces. Xerox Arabic Finite State Morphology and Buckwalter Arabic Morphological Analyzer are two of the best known, well documented, morphological analyzers for Modern Standard Arabic (MSA). Yet there are significant problems with both systems in design as well as coverage that increase the ambiguity rate. This paper shows how an ambiguity-controlled morphological analyzer for Arabic is built in a rule-based system that takes the stem as the base form using finite state technology. The paper also points out sources of legal and illegal ambiguities in MSA, and how ambiguity in the new system is reduced without compromising precision. At the end, an evaluation of Xerox, Buckwalter, and our system is conducted, and the performance is compared and analyzed.
Arabic has a very rich and complex morphology. Its appropriate morphological processing is very important for Information Retrieval (IR). In this paper, we propose a new stemming technique that tries to determine the stem of a word representing the semantic core of this word according to Arabic morphology. This method is compared to a commonly used light stemming technique which truncates a word by simple rules. Our tests on TREC collections show that the new stemming technique is more effective than the light stemming.
After providing a brief introduction to the transliteration problem, and highlighting some issues specific to Arabic to English translation, a three phase algorithm is introduced as a computational solution to the problem. The algorithm is based on a Hidden Markov Model approach, but also leverages information available in on-line databases. The algorithm is then evaluated, and shown to achieve accuracy approaching .80%
This article describes the construction of a lexicon and a morphological description for standard Arabic. This system uses finite state technology to parse vowelled texts, as well as partially and not vowelled ones. It is based on large-coverage morphological grammars covering all grammatical rules.
Cross-language information retrieval consists in providing a query in one language and searching documents in different languages. Retrieved documents are ordered by the probability of being relevant to the user's request with the highest ranked being considered the most relevant document. The LIC2M cross-language information retrieval system is a weighted Boolean search engine based on a deep linguistic analysis of the query and the documents to be indexed. This system, designed to work on Arabic, Chinese, English, French, German and Spanish, is composed of a multilingual linguistic analyzer, a statistical analyzer, a reformulator, a comparator and a search engine. The multilingual linguistic analyzer includes a morphological analyzer, a part-of-speech tagger and a syntactic analyzer. In the case of Arabic, a clitic stemmer is added to the morphological analyzer to segment the input words into proclitics, simple forms and enclitics. The linguistic analyzer processes both documents to be indexed and queries to produce a set of normalized lemmas, a set of named entities and a set of nominal compounds with their morpho-syntactic tags. The statistical analyzer computes for documents to be indexed concept weights based on concept database frequencies. The comparator computes intersections between queries and documents and provides a relevance weight for each intersection. Before this comparison, the reformulator expands queries during the search. The expansion is used to infer from the original query words other words expressing the same concepts. The expansion can be in the same language or in different languages. The search engine retrieves the ranked, relevant documents from the indexes according to the corresponding reformulated query and then merges the results obtained for each language, taking into account the original words of the query and their weights in order to score the documents. Sentence alignment consists in estimating which sentence or sentences in the source language correspond with which sentence or sentences in a target language. We present in this paper a new approach to aligning sentences from a parallel corpora based on the LIC2M cross-language information retrieval system. This approach consists in building a database of sentences of the target text and considering each sentence of the source text as a "query" to that database. The aligned bilingual parallel corpora can be used as a translation memory in a computer-aided translation tool.
This paper describes the construction of a dependency bank gold standard for Arabic, DCU 250 Arabic Dependency Bank (DCU 250), based on the Arabic Penn Treebank Corpus (ATB) (Bies and Maamouri, 2003; Maamouri and Bies, 2004) within the theoretical framework of Lexical Functional Grammar (LFG). For parsing and automatically extracting grammatical and lexical resources from treebanks, it is necessary to evaluate against established gold standard resources. Gold standards for various languages have been developed, but to our knowledge, such a resource has not yet been constructed for Arabic. The construction of the DCU 250 marks the first step towards the creation of an automatic LFG f-structure annotation algorithm for the ATB, and for the extraction of Arabic grammatical and lexical resources.
The paper describes the translations of three online systems: Google, Sakhr, and Systran, using two sets of texts (Arabic and English) as input. It diagnoses the faults and attempts to detect the reasons, trying to shed light on the areas where the right translation solution is missed. Flaws and translation problems are categorized and analyzed, and recommendations are given. The two modes of translation (from and into Arabic) face a wide range of common linguistic problems as well as mode-specific problems. These problems are discussed and examples of output are given. The paper raises questions whose answers should help in the improvement of MT systems. The questions deal with establishing equivalents, lexical environment, and collocation. Cases that triggered these questions are illustrated and discussed.
The interlingua approach to Machine Translation (MT) aims to achieve the translation task in two independent steps. First, the meanings of source language sentences are represented in an intermediate (interlingua) representation. Then, sentences of the target language are generated from those meaning representations. In the generation of the target sentence, determining sentence structures becomes more difficult, especially when the interlingua does not contain any syntactic information. Hence, the sentence structures cannot be transferred exactly from the interlingua representations. In this paper, we present a mapping approach for task- oriented interlingua-based spoken dialogue that transforms an interlingua representation, so-called Interchange Format (IF), into a feature structure (FS) that reflects the syntactic structure of the target Arabic sentence. This approach addresses the handling of the problem of Arabic syntactic structure determination in the interlingua approach. A mapper is developed primarily within the framework of the NESPOLE! (NEgotiating through SPOken Language in E-commerce) multilingual speech-to-speech MT project. The IF-to-Arabic FS mapper is implemented in SICStus Prolog. Examples of Arabic syntactic mapping, using the output from the English analyzer provided by Carnegie Mellon University (CMU), will illustrate how the system works.