Nouran Khallaf

2026

A Multilingual Human Annotated Corpus of Original and Easy-to-Read Texts to Support Access to Democratic Participatory Processes
Stefan Bott | Verena Riegler | Horacio Saggion | Almudena Rascón Alcaina | Nouran Khallaf
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Being able to understand information is a key factor for a self-determined life and society. The study of automatic text simplification is often limited by the availability of high quality material for the training and evaluation on automatic simplifiers. This is true for English, but more so for less resourced languages like Spanish, Catalan and Italian. In order to fill this gap, we present a corpus of of original texts with high quality simplification produced by human experts in text simplification. It was developed within a project to assess the impact of Easy-to-Read (E2R) language for democratic participation. The original texts were compiled from domains related to this topic. The corpus includes different text types, selected based on relevance, copyright availability, and ethical standards. All texts were simplified to Easy-to-Read level. The corpora hold significant scientific value, particularly as it includes the first annotated corpora of its kind for the Catalan language. It also represents a noteworthy contribution for Spanish and Italian, offering high-quality, human-annotated language resources that are rarely available in these domains. The corpora will be made freely accessible to the public.

bib abs

How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
Nouran Khallaf | Serge Sharoff
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. This study explores a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM-based noise filtering proves particularly effective in improving prediction quality by raising the AUC score from 0.52 to 0.86, or to 0.92 when two de-noising methods are combined (GMM and Co-Teaching). However, for our larger dataset, the intrinsic regularisation of pre-trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.8948 to 0.8984, or to 0.9061 when two denoising methods are combined). Nonetheless, removing noisy sentences (about 20% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest available multilingual corpus for sentence difficulty prediction.

bib abs

To Predict or Not to Predict? Towards Reliable Uncertainty Estimation in the Presence of Noise
Nouran Khallaf | Serge Sharoff
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This study examines the role of uncertainty estimation (UE) methods in multilingual text classification under noisy and non-topical conditions. Using a complex-vs-simple sentence classification task across several languages, we evaluate a range of UE techniques against a range of metrics to assess their quality. Results indicate that while methods relying on softmax outputs remain competitive in high-resource in-domain settings, their reliability declines in low-resource or domain-shift scenarios. In contrast, Monte Carlo dropout approaches demonstrate consistently strong performance across all languages, offering more robust calibration, stable decision thresholds, and greater discriminative power even under adverse conditions. We further demonstrate the positive impact of UE on non-topical classification: selectively abstaining from predicting the 10% most uncertain instances increases the macro F1 score from 0.81 to 0.85 in the Readme task. By integrating UE with trustworthiness metrics, this study provides actionable insights for developing more reliable NLP systems in real-world multilingual environments.

bib abs

Align and Shine: Building High-quality Sentence-aligned Corpora for Multilingual Text Simplification
Luis Kenji Hilasaca Sanchez | Nouran Khallaf | Serge Sharoff
Proceedings of the 19th Workshop on Building and Using Comparable Corpora (BUCC)

Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.

pdf bib abs

UOL@IDEM at BEA 2026 Shared Task 1: Neural Fusion and Feature-Rich Modeling for L1-Aware Vocabulary Difficulty Prediction
Nouran Khallaf | Serge Sharoff
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

This paper describes UOL@IDEM’s closed-track submission to the BEA 2026 shared task on L1-aware vocabulary difficulty prediction. We model the task as regression and train separate systems for Spanish, German, and Mandarin Chinese. Our system combines multilingual contextual representations with engineered features capturing frequency, surface form, retrieval evidence, semantic alignment, cognate similarity, and masked-language-model predictability. Development results show consistent gains over the official closed-track baselines, with sentence-embedding encoders such as BGE-M3, multilingual E5, and LaBSE performing best. Official submissions achieve RMSE scores of 1.132, 1.037, and 0.891 for Spanish, German, and Chinese, respectively. Feature analysis identifies frequency as the most stable predictor, while contextual predictability, form similarity, retrieval, and semantic features provide complementary L1-sensitive signals. Error analysis shows strong ranking performance but weaker calibration for the easiest items, which are often overpredicted.

2025

pdf bib abs

Reading Between the Lines: A dataset and a study on why some texts are tougher than others
Nouran Khallaf | Carlo Eugeni | Serge Sharoff
Proceedings of the First Workshop on Writing Aids at the Crossroads of AI, Cognitive Science and NLP (WRAICOGS 2025)

Our research aims at better understanding what makes a text difficult to read for specific audiences with intellectual disabilities, more specifically, people who have limitations in cognitive functioning, such as reading and understanding skills, an IQ below 70, and challenges in conceptual domains. We introduce a scheme for the annotation of difficulties which is based on empirical research in psychology as well as on research in translation studies. The paper describes the annotated dataset, primarily derived from the parallel texts (standard English and Easy to Read English translations) made available online. we fine-tuned four different pre-trained transformer models to perform the task of multiclass classification to predict the strategies required for simplification. We also investigate the possibility to interpret the decisions of this language model when it is aimed at predicting the difficulty of sentences in this dataset.

pdf bib abs

UoL-UPF at TSAR 2025 Shared Task A Generate-and-Select Approach for Readability-Controlled Text Simplification
Akio Hayakawa | Nouran Khallaf | Horacio Saggion | Serge Sharoff
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)

The TSAR 2025 Shared Task on Readability-Controlled Text Simplification focuses on simplifying English paragraphs written at an advanced level (B2 or higher) and rewriting them to target CEFR levels (A2 or B1). The challenge is to reduce linguistic complexity without sacrificing coherence or meaning. We developed three complementary approaches based on large language models (LLMs). The first approach (Run 1) generates a diverse set of paragraph-level simplifications. It then applies filters to enforce CEFR alignment, preserve meaning, and encourage diversity, and finally selects the candidates with the lowest perceived risk. The second (Run 2) performs simplification at the sentence level, combining structured prompting, coreference resolution, and explainable AI techniques to highlight influential phrases, with candidate selection guided by automatic and LLM-based judges. The third hybrid approach (Run 3) integrates both strategies by pooling paragraph- and sentence-level simplifications, and subsequently applying the identical filtering and selection architecture used in Run 1. In the official TSAR evaluation, the hybrid system ranked 2nd overall, while its component systems also achieved competitive results.

pdf bib

Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)
Matthew Shardlow | Fernando Alva-Manchego | Kai North | Regina Stodden | Horacio Saggion | Nouran Khallaf | Akio Hayakawa
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)

pdf bib abs

FreeTxt: Analyse and Visualise Multilingual Qualitative Survey Data for Cultural Heritage Sites
Nouran Khallaf | Ignatius Ezeani | Dawn Knight | Paul Rayson | Mo El-Haj | John Vidler | James Davies | Fernando Alva-Manchego
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

We introduce FreeTxt, a free and open-source web-based tool designed to support the analysis and visualisation of multilingual qualitative survey data, with a focus on low-resource languages. Developed in collaboration with stakeholders, FreeTxt integrates established techniques from corpus linguistics with modern natural language processing methods in an intuitive interface accessible to non-specialists. The tool currently supports bilingual processing and visualisation of English and Welsh responses, with ongoing extensions to other languages such as Vietnamese. Key functionalities include semantic tagging via PyMUSAS, multilingual sentiment analysis, keyword and collocation visualisation, and extractive summarisation. User evaluations with cultural heritage institutions demonstrate the system’s utility and potential for broader impact.

pdf bib abs

Democracy Made Easy: Simplifying Complex Topics to Enable Democratic Participation
Nouran Khallaf | Stefan Bott | Carlo Eugeni | John O’Flaherty | Serge Sharoff | Horacio Saggion
Proceedings of the 1st Workshop on Artificial Intelligence and Easy and Plain Language in Institutional Contexts (AI & EL/PL)

Several people are excluded from democratic deliberation because the language which is used in this context may be too difficult to understand for them. Our iDEM project aims at lowering existing linguistic barriers in deliberative processes by developing technology to facilitate the translation of complicated text into easy to read formats which are more suitable for may people. In this paper we describe classification experiments for detecting different types of difficulties which should be amended in order to make texts easier to understand. We focus on a lexical simplification system which can achieve state-of-the-art results with the use of a free and open-weight Large Language Model for the Romance Languages in the iDEM project. Moreover, a sentence segmentation system is introduced that can create text segmentation for long sentences based on training data. We describe the iDEM mobile app, which will make our technology available as a service for end-users of our target populations.

This paper presents an attempt to build a Modern Standard Arabic (MSA) sentence-level simplification system. We experimented with sentence simplification using two approaches: (i) a classification approach leading to lexical simplification pipelines which use Arabic-BERT, a pre-trained contextualised model, as well as a model of fastText word embeddings; and (ii) a generative approach, a Seq2Seq technique by applying a multilingual Text-to-Text Transfer Transformer mT5. We developed our training corpus by aligning the original and simplified sentences from the internationally acclaimed Arabic novel Saaq al-Bambuu. We evaluate effectiveness of these methods by comparing the generated simple sentences to the target simple sentences using the BERTScore evaluation metric. The simple sentences produced by the mT5 model achieve P 0.72, R 0.68 and F-1 0.70 via BERTScore, while, combining Arabic-BERT and fastText achieves P 0.97, R 0.97 and F-1 0.97. In addition, we report a manual error analysis for these experiments.

pdf bib abs

AraSAS: The Open Source Arabic Semantic Tagger
Mahmoud El-Haj | Elvis de Souza | Nouran Khallaf | Paul Rayson | Nizar Habash
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

This paper presents (AraSAS) the first open-source Arabic semantic analysis tagging system. AraSAS is a software framework that provides full semantic tagging of text written in Arabic. AraSAS is based on the UCREL Semantic Analysis System (USAS) which was first developed to semantically tag English text. Similarly to USAS, AraSAS uses a hierarchical semantic tag set that contains 21 major discourse fields and 232 fine-grained semantic field tags. The paper describes the creation, validation and evaluation of AraSAS. In addition, we demonstrate a first case study to illustrate the affordances of applying USAS and AraSAS semantic taggers on the Zayed University Arabic-English Bilingual Undergraduate Corpus (ZAEBUC) (Palfreyman and Habash, 2022), where we show and compare the coverage of the two semantic taggers through running them on Arabic and English essays on different topics. The analysis expands to compare the taggers when run on texts in Arabic and English written by the same writer and texts written by male and by female students. Variables for comparison include frequency of use of particular semantic sub-domains, as well as the diversity of semantic elements within a text.

2021

pdf bib abs

Automatic Difficulty Classification of Arabic Sentences
Nouran Khallaf | Serge Sharoff
Proceedings of the Sixth Arabic Natural Language Processing Workshop

In this paper, we present a Modern Standard Arabic (MSA) Sentence difficulty classifier, which predicts the difficulty of sentences for language learners using either the CEFR proficiency levels or the binary classification as simple or complex. We compare the use of sentence embeddings of different kinds (fastText, mBERT , XLM-R and Arabic-BERT), as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners. Our best results have been achieved using fined-tuned Arabic-BERT. The accuracy of our 3-way CEFR classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification respectively and 0.71 Spearman correlation for regression. Our binary difficulty classifier reaches F-1 0.94 and F-1 0.98 for sentence-pair semantic similarity classifier.

Venues

BEA1

LDK1

Nouran Khallaf

2026

2025

2024

2023

2022

2021

Co-authors

Venues