Eugénio Ribeiro
2026
Portho: A Corpus-Based Resource of Orthographic Neighbors in European Portuguese
Eugénio Ribeiro | David Antunes | Nuno Mamede | Jorge Baptista
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Eugénio Ribeiro | David Antunes | Nuno Mamede | Jorge Baptista
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Orthographic neighbors (ONs) play a central role in models of visual word recognition and have been shown to influence reading speed, lexical access, and literacy development. Despite their importance, resources providing detailed and flexible ON information remain scarce for European Portuguese. This paper introduces Portho, a corpus-based lexical resource that provides multiple ON metrics for over 43,000 word forms, using several ON definitions. In addition to classical neighborhood size measures, Portho provides frequency-based statistics and graded orthographic distance (OD) features. We analyze the statistical properties of the resource and evaluate its empirical utility in automatic text complexity assessment using the iRead4Skills corpus. Results show that while ON features alone are insufficient to predict readability, they contribute complementary information and compare favorably with existing resources for Portuguese. Portho is made publicly available in different formats to support research in psycholinguistics, readability modeling, and Natural Language Processing (NLP) for Portuguese.
From Complexity Scores to Readable Texts: iRead4Skills for Adult Literacy in Portuguese
Jorge Baptista | Eugénio Ribeiro | Nuno Mamede | David Antunes | Raquel Amaro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Jorge Baptista | Eugénio Ribeiro | Nuno Mamede | David Antunes | Raquel Amaro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Adult Learning (AL) programmes need short, trustworthy texts that match learners’ reading abilities, but educators rarely have time, tools, or evidence-based guidelines to select and adapt materials consistently.We present a live demo of iRead4Skills for European Portuguese: a web-based system that (i) estimates readability/complexity for AL-oriented levels aligned with CEFR, (ii) highlights where complexity concentrates (lexical, grammatical, semantic), and (iii) supports rewriting by offering actionable, level-aware suggestions and curated lexical resources.The demo emphasises transparency and “trainer-first” workflows: users see *why* a text is complex and *how* to revise it without losing meaning.
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Marlo Souza | Iria de-Dios-Flores | Diana Santos | Larissa Freitas | Jackson Wilke da Cruz Souza | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Marlo Souza | Iria de-Dios-Flores | Diana Santos | Larissa Freitas | Jackson Wilke da Cruz Souza | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Lexicon-Grammar Web
Jorge Baptista | David Antunes | Nuno Mamede | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Jorge Baptista | David Antunes | Nuno Mamede | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
This demo showcases a web-based interface that provides open, interactive access to a large-scale grammatical database of European Portuguese verbal constructions. Through a unified search and exploration environment, users can query, inspect, and compare more than 7,000 distributionally free verbal constructions and over 2,700 verbal idioms (frozen constructions), grounded in long-standing Lexicon–Grammar descriptions. For each construction, the interface exposes core linguistic properties such as argument structure, distributional constraints, semantic roles, major syntactic transformations, and curated usage examples with English translations. The demo illustrates how detailed, manually validated grammatical knowledge can be explored dynamically via the web, supporting linguistic research, language teaching, and NLP development. To the best of our knowledge, this is the largest publicly accessible, web-based grammatical resource dedicated to European Portuguese verbal constructions.
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Marlo Souza | Iria de-Dios-Flores | Diana Santos | Larissa Freitas | Jackson Wilke da Cruz Souza | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Marlo Souza | Iria de-Dios-Flores | Diana Santos | Larissa Freitas | Jackson Wilke da Cruz Souza | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
2025
The iRead4Skills Intelligent Complexity Analyzer
Wafa Aissa | Raquel Amaro | David Antunes | Thibault Bañeras-Roux | Jorge Baptista | Alejandro Catala | Luís Correia | Thomas François | Marcos Garcia | Mario Izquierdo-Álvarez | Nuno Mamede | Vasco Martins | Miguel Neves | Eugénio Ribeiro | Sandra Rodriguez Rey | Elodie Vanzeveren
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Wafa Aissa | Raquel Amaro | David Antunes | Thibault Bañeras-Roux | Jorge Baptista | Alejandro Catala | Luís Correia | Thomas François | Marcos Garcia | Mario Izquierdo-Álvarez | Nuno Mamede | Vasco Martins | Miguel Neves | Eugénio Ribeiro | Sandra Rodriguez Rey | Elodie Vanzeveren
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We present the iRead4Skills Intelligent Complexity Analyzer, an open-access platform specifically designed to assist educators and content developers in addressing the needs of low-literacy adults by analyzing and diagnosing text complexity. This multilingual system integrates a range of Natural Language Processing (NLP) components to assess input texts along multiple levels of granularity and linguistic dimensions in Portuguese, Spanish, and French. It assigns four tailored difficulty levels using state-of-the-art models, and introduces four diagnostic yardsticks—textual structure, lexicon, syntax, and semantics—offering users actionable feedback on specific dimensions of textual complexity. Each component of the system is supported by experiments comparing alternative models on manually annotated data.
UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment
Joseph Marvin Imperial | Abdullah Barayan | Regina Stodden | Rodrigo Wilkens | Ricardo Muñoz Sánchez | Lingyun Gao | Melissa Torgbi | Dawn Knight | Gail Forey | Reka R. Jablonkai | Ekaterina Kochmar | Robert Joshua Reynolds | Eugénio Ribeiro | Horacio Saggion | Elena Volodina | Sowmya Vajjala | Thomas François | Fernando Alva-Manchego | Harish Tayyar Madabushi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Joseph Marvin Imperial | Abdullah Barayan | Regina Stodden | Rodrigo Wilkens | Ricardo Muñoz Sánchez | Lingyun Gao | Melissa Torgbi | Dawn Knight | Gail Forey | Reka R. Jablonkai | Ekaterina Kochmar | Robert Joshua Reynolds | Eugénio Ribeiro | Horacio Saggion | Elena Volodina | Sowmya Vajjala | Thomas François | Fernando Alva-Manchego | Harish Tayyar Madabushi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.
2024
Text Readability Assessment in European Portuguese: A Comparison of Classification and Regression Approaches
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Exploring the Automated Scoring of Narrative Essays in Brazilian Portuguese using Transformer Models
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2
Automatic Text Readability Assessment in European Portuguese
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
2020
Mapping the Dialog Act Annotations of the LEGO Corpus into ISO 24617-2 Communicative Functions
Eugénio Ribeiro | Ricardo Ribeiro | David Martins de Matos
Proceedings of the Twelfth Language Resources and Evaluation Conference
Eugénio Ribeiro | Ricardo Ribeiro | David Martins de Matos
Proceedings of the Twelfth Language Resources and Evaluation Conference
ISO 24617-2, the ISO standard for dialog act annotation, sets the ground for more comparable research in the area. However, the amount of data annotated according to it is still reduced, which impairs the development of approaches for automatic recognition. In this paper, we describe a mapping of the original dialog act labels of the LEGO corpus, which have been neglected, into the communicative functions of the standard. Although this does not lead to a complete annotation according to the standard, the 347 dialogs provide a relevant amount of data that can be used in the development of automatic communicative function recognition approaches, which may lead to a wider adoption of the standard. Using the 17 English dialogs of the DialogBank as gold standard, our preliminary experiments have shown that including the mapped dialogs during the training phase leads to improved performance while recognizing communicative functions in the Task dimension.
2019
L2F/INESC-ID at SemEval-2019 Task 2: Unsupervised Lexical Semantic Frame Induction using Contextualized Word Representations
Eugénio Ribeiro | Vânia Mendonça | Ricardo Ribeiro | David Martins de Matos | Alberto Sardinha | Ana Lúcia Santos | Luísa Coheur
Proceedings of the 13th International Workshop on Semantic Evaluation
Eugénio Ribeiro | Vânia Mendonça | Ricardo Ribeiro | David Martins de Matos | Alberto Sardinha | Ana Lúcia Santos | Luísa Coheur
Proceedings of the 13th International Workshop on Semantic Evaluation
Building large datasets annotated with semantic information, such as FrameNet, is an expensive process. Consequently, such resources are unavailable for many languages and specific domains. This problem can be alleviated by using unsupervised approaches to induce the frames evoked by a collection of documents. That is the objective of the second task of SemEval 2019, which comprises three subtasks: clustering of verbs that evoke the same frame and clustering of arguments into both frame-specific slots and semantic roles. We approach all the subtasks by applying a graph clustering algorithm on contextualized embedding representations of the verbs and arguments. Using such representations is appropriate in the context of this task, since they provide cues for word-sense disambiguation. Thus, they can be used to identify different frames evoked by the same words. Using this approach we were able to outperform all of the baselines reported for the task on the test set in terms of Purity F1, as well as in terms of BCubed F1 in most cases.
2016
SPA: Web-based Platform for easy Access to Speech Processing Modules
Fernando Batista | Pedro Curto | Isabel Trancoso | Alberto Abad | Jaime Ferreira | Eugénio Ribeiro | Helena Moniz | David Martins de Matos | Ricardo Ribeiro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Fernando Batista | Pedro Curto | Isabel Trancoso | Alberto Abad | Jaime Ferreira | Eugénio Ribeiro | Helena Moniz | David Martins de Matos | Ricardo Ribeiro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper presents SPA, a web-based Speech Analytics platform that integrates several speech processing modules and that makes it possible to use them through the web. It was developed with the aim of facilitating the usage of the modules, without the need to know about software dependencies and specific configurations. Apart from being accessed by a web-browser, the platform also provides a REST API for easy integration with other applications. The platform is flexible, scalable, provides authentication for access restrictions, and was developed taking into consideration the time and effort of providing new services. The platform is still being improved, but it already integrates a considerable number of audio and text processing modules, including: Automatic transcription, speech disfluency classification, emotion detection, dialog act recognition, age and gender classification, non-nativeness detection, hyper-articulation detection, dialog act recognition, and two external modules for feature extraction and DTMF detection. This paper describes the SPA architecture, presents the already integrated modules, and provides a detailed description for the ones most recently integrated.
Search
Fix author
Co-authors
- Jorge Baptista 7
- Nuno Mamede 7
- David Antunes 4
- David Martins de Matos 3
- Ricardo Ribeiro 3
- Raquel Amaro 2
- Thomas François 2
- Larissa Freitas 2
- Diana Santos 2
- Marlo Souza 2
- Jackson Wilke da Cruz Souza 2
- Iria de-Dios-Flores 2
- Alberto Abad 1
- Wafa Aissa 1
- Fernando Alva-Manchego 1
- Abdullah Barayan 1
- Fernando Batista 1
- Thibault Bañeras-Roux 1
- Alejandro Catala 1
- Luísa Coheur 1
- Luís Correia 1
- Pedro Curto 1
- Jaime Ferreira 1
- Gail Forey 1
- Lingyun Gao 1
- Marcos Garcia 1
- Joseph Marvin Imperial 1
- Mario Izquierdo-Álvarez 1
- Reka R. Jablonkai 1
- Dawn Knight 1
- Ekaterina Kochmar 1
- Vasco Martins 1
- Vânia Mendonça 1
- Helena Moniz 1
- Ricardo Muñoz Sánchez 1
- Miguel Neves 1
- Sandra Rodriguez Rey 1
- Robert Joshua Reynolds 1
- Horacio Saggion 1
- Ana Lúcia Santos 1
- Alberto Sardinha 1
- Regina Stodden 1
- Harish Tayyar Madabushi 1
- Melissa Torgbi 1
- Isabel Trancoso 1
- Sowmya Vajjala 1
- Elodie Vanzeveren 1
- Elena Volodina 1
- Rodrigo Wilkens 1