David Antunes
2025
The iRead4Skills Intelligent Complexity Analyzer
Wafa Aissa
|
Raquel Amaro
|
David Antunes
|
Thibault Bañeras-Roux
|
Jorge Baptista
|
Alejandro Catala
|
Luís Correia
|
Thomas François
|
Marcos Garcia
|
Mario Izquierdo-Álvarez
|
Nuno Mamede
|
Vasco Martins
|
Miguel Neves
|
Eugénio Ribeiro
|
Sandra Rodriguez Rey
|
Elodie Vanzeveren
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We present the iRead4Skills Intelligent Complexity Analyzer, an open-access platform specifically designed to assist educators and content developers in addressing the needs of low-literacy adults by analyzing and diagnosing text complexity. This multilingual system integrates a range of Natural Language Processing (NLP) components to assess input texts along multiple levels of granularity and linguistic dimensions in Portuguese, Spanish, and French. It assigns four tailored difficulty levels using state-of-the-art models, and introduces four diagnostic yardsticks—textual structure, lexicon, syntax, and semantics—offering users actionable feedback on specific dimensions of textual complexity. Each component of the system is supported by experiments comparing alternative models on manually annotated data.
A European Portuguese corpus annotated for verbal idioms
David Antunes
|
Jorge Baptista
|
Nuno J. Mamede
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
This paper presents the construction of VIDiom-PT, a corpus in European Portuguese annotated for verbal idioms (e.g. O Rui bateu a bota, lit.: Rui hit the boot ‘Rui died’). This linguistic resource aims to support the development of systems capable of processing such constructions in this language variety. To assist in the annotation effort, two tools were built. The first allows for the detection of possible instances of verbal idioms in texts, while the second provides a graphical interface for annotating them. This effort culminated in the annotation of a total of 5,178 instances of 747 different verbal idioms in more than 200,000 sentences in European Portuguese. A highly reliable inter-annotator agreement was achieved, using Krippendorff’s alpha for nominal data (0.869) with 5% of the data independently annotated by 3 experts. Part of the annotated corpus is also made publicly available.