Behrang Mohit

2016

pdf abs
Building an Arabic Machine Translation Post-Edited Corpus: Guidelines and Annotation
Wajdi Zaghouani | Nizar Habash | Ossama Obeid | Behrang Mohit | Houda Bouamor | Kemal Oflazer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic. Our overarching goal is to use the annotated corpus to develop automatic machine translation post-editing systems for Arabic that can be used to help accelerate the human revision process of translated texts. The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter-annotator agreement measures were performed to check the annotation quality. The created corpus of manual post-edited translations of English to Arabic articles is the largest to date for this language pair.

2015

2014

pdf
The First QALB Shared Task on Automatic Text Correction for Arabic
Behrang Mohit | Alla Rozovskaya | Nizar Habash | Wajdi Zaghouani | Ossama Obeid
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf
CMUQ-Hybrid: Sentiment Classification By Feature Engineering and Parameter Tuning
Kamla Al-Mannai | Hanan Alshikhabobakr | Sabih Bin Wasi | Rukhsar Neyaz | Houda Bouamor | Behrang Mohit
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf
CMUQ@Qatar:Using Rich Lexical Features for Sentiment Analysis on Twitter
Sabih Bin Wasi | Rukhsar Neyaz | Houda Bouamor | Behrang Mohit
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf abs
YouDACC: the Youtube Dialectal Arabic Comment Corpus
Ahmed Salama | Houda Bouamor | Behrang Mohit | Kemal Oflazer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents YOUDACC, an automatically annotated large-scale multi-dialectal Arabic corpus collected from user comments on Youtube videos. Our corpus covers different groups of dialects: Egyptian (EG), Gulf (GU), Iraqi (IQ), Maghrebi (MG) and Levantine (LV). We perform an empirical analysis on the crawled corpus and demonstrate that our location-based proposed method is effective for the task of dialect labeling.

We present annotation guidelines and a web-based annotation framework developed as part of an effort to create a manually annotated Arabic corpus of errors and corrections for various text types. Such a corpus will be invaluable for developing Arabic error correction tools, both for training models and as a gold standard for evaluating error correction algorithms. We summarize the guidelines we created. We also describe issues encountered during the training of the annotators, as well as problems that are specific to the Arabic language that arose during the annotation process. Finally, we present the annotation tool that was developed as part of this project, the annotation pipeline, and the quality of the resulting annotations.

pdf
A Human Judgement Corpus and a Metric for Arabic MT Evaluation
Houda Bouamor | Hanan Alshikhabobakr | Behrang Mohit | Kemal Oflazer
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf
SuMT: A Framework of Summarization and MT
Houda Bouamor | Behrang Mohit | Kemal Oflazer
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Dudley North visits North London: Learning When to Transliterate to Arabic
Mahmoud Azab | Houda Bouamor | Behrang Mohit | Kemal Oflazer
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Supersense Tagging for Arabic: the MT-in-the-Middle Attack
Nathan Schneider | Behrang Mohit | Chris Dyer | Kemal Oflazer | Noah A. Smith
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf abs
Annotating and Learning Morphological Segmentation of Egyptian Colloquial Arabic
Emad Mohamed | Behrang Mohit | Kemal Oflazer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present an annotation and morphological segmentation scheme for Egyptian Colloquial Arabic (ECA) in which we annotate user-generated content that significantly deviates from the orthographic and grammatical rules of Modern Standard Arabic and thus cannot be processed by the commonly used MSA tools. Using a per letter classification scheme in which each letter is classified as either a segment boundary or not, and using a memory-based classifier, with only word-internal context, prove effective and achieve a 92% exact match accuracy at the word level. The well-known MADA system achieves 81% while the per letter classification scheme using the ATB achieves 82%. Error analysis shows that the major problem is that of character ambiguity since the ECA orthography overloads the characters which would otherwise be more specific in MSA, like the differences between y (Ù) and Y (Ù) and A (Ø§) , > ( Ø£), and < (Ø¥) which are collapsed to y (Ù) and A (Ø§) respectively or even totally confused and interchangeable. While normalization helps alleviate orthographic inconsistencies, it aggravates the problem of ambiguity.

pdf
Recall-Oriented Learning of Named Entities in Arabic Wikipedia
Behrang Mohit | Nathan Schneider | Rishav Bhowmick | Kemal Oflazer | Noah A. Smith
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf
Transforming Standard Arabic to Colloquial Arabic
Emad Mohamed | Behrang Mohit | Kemal Oflazer
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study
Nathan Schneider | Behrang Mohit | Kemal Oflazer | Noah A. Smith
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2010

pdf abs
Using Variable Decoding Weight for Language Model in Statistical Machine Translation
Behrang Mohit | Rebecca Hwa | Alon Lavie
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper investigates varying the decoder weight of the language model (LM) when translating different parts of a sentence. We determine the condition under which the LM weight should be adapted. We find that a better translation can be achieved by varying the LM weight when decoding the most problematic spot in a sentence, which we refer to as a difficult segment. Two adaptation strategies are proposed and compared through experiments. We find that adapting a different LM weight for every difficult segment resulted in the largest improvement in translation quality.

pdf
Improving Phrase-Based Translation with Prototypes of Short Phrases
Frank Liberato | Behrang Mohit | Rebecca Hwa
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics