2024
pdf
abs
The SAMER Arabic Text Simplification Corpus
Bashar Alhafni
|
Reem Hazim
|
Juan David Pineros Liberato
|
Muhamed Al Khalil
|
Nizar Habash
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present the SAMER Corpus, the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels. We describe the corpus selection process, and outline the guidelines we followed to create the annotations and ensure their quality. Our corpus is publicly available to support and encourage research on Arabic text simplification, Arabic automatic readability assessment, and the development of Arabic pedagogical language technologies.
2022
pdf
abs
Arabic Word-level Readability Visualization for Assisted Text Simplification
Reem Hazim
|
Hind Saddiki
|
Bashar Alhafni
|
Muhamed Al Khalil
|
Nizar Habash
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
This demo paper presents a Google Docs add-on for automatic Arabic word-level readability visualization. The add-on includes a lemmatization component that is connected to a five-level readability lexicon and Arabic WordNet-based substitution suggestions. The add-on can be used for assessing the reading difficulty of a text and identifying difficult words as part of the task of manual text simplification. We make our add-on and its code publicly available.
2020
pdf
abs
A Large-Scale Leveled Readability Lexicon for Standard Arabic
Muhamed Al Khalil
|
Nizar Habash
|
Zhengyang Jiang
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present a large-scale 26,000-lemma leveled readability lexicon for Modern Standard Arabic. The lexicon was manually annotated in triplicate by language professionals from three regions in the Arab world. The annotations show a high degree of agreement; and major differences were limited to regional variations. Comparing lemma readability levels with their frequencies provided good insights in the benefits and pitfalls of frequency-based readability approaches. The lexicon will be publicly available.
pdf
abs
An Online Readability Leveled Arabic Thesaurus
Zhengyang Jiang
|
Nizar Habash
|
Muhamed Al Khalil
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations
This demo paper introduces the online Readability Leveled Arabic Thesaurus interface. For a given user input word, this interface provides the word’s possible lemmas, roots, English glosses, related Arabic words and phrases, and readability on a five-level readability scale. This interface builds on and connects multiple existing Arabic resources and processing tools. This one-of-a-kind system enables Arabic speakers and learners to benefit from advances in Arabic computational linguistics technologies. Feedback from users of the system will help the developers to identify lexical coverage gaps and errors. A live link to the demo is available at:
http://samer.camel-lab.com/.
2018
pdf
abs
Feature Optimization for Predicting Readability of Arabic L1 and L2
Hind Saddiki
|
Nizar Habash
|
Violetta Cavalli-Sforza
|
Muhamed Al Khalil
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications
Advances in automatic readability assessment can impact the way people consume information in a number of domains. Arabic, being a low-resource and morphologically complex language, presents numerous challenges to the task of automatic readability assessment. In this paper, we present the largest and most in-depth computational readability study for Arabic to date. We study a large set of features with varying depths, from shallow words to syntactic trees, for both L1 and L2 readability tasks. Our best L1 readability accuracy result is 94.8% (75% error reduction from a commonly used baseline). The comparable results for L2 are 72.4% (45% error reduction). We also demonstrate the added value of leveraging L1 features for L2 readability prediction.
pdf
A Leveled Reading Corpus of Modern Standard Arabic
Muhamed Al Khalil
|
Hind Saddiki
|
Nizar Habash
|
Latifa Alfalasi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
abs
Exploiting Arabic Diacritization for High Quality Automatic Annotation
Nizar Habash
|
Anas Shahrour
|
Muhamed Al-Khalil
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined.