Jesús Calleja

2026

IREKIER: An Easy Read Corpus for Basque and Spanish
Jesús Calleja | Thierry Etchegoyhen
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Easy Read (ER) text adaptation is one of the main means to provide accessible content for people with reading difficulties. ER text features aspects of text simplification, along with specific characteristics such as the need for short sentences, clearly structured content, and explanations for complex concepts. Support for ER text generation is still lacking overall, with few available resources to build automated systems upon. In this work, we describe the IREKIER corpus, based on ER news in Basque and Spanish from the Irekia transparency portal of the Basque Government. This corpus is currently one of the largest publicly shared resource to support training and evaluation of ER text adaptation models in these two languages, and the first of its kind for Basque. We describe our methodology to create the resource, along with the specific challenges raised by ER text. We also provide both intrinsic and extrinsic evaluations of the corpus, which is shared with the scientific community under a CC-BY-NC-ND 4.0 license.

2024

pdf bib abs

Split and Rephrase with Large Language Models
David Ponce | Thierry Etchegoyhen | Jesús Calleja | Harritxu Gete
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The Split and Rephrase (SPRP) task, which consists in splitting complex sentences into a sequence of shorter grammatical sentences, while preserving the original meaning, can facilitate the processing of complex texts for humans and machines alike. It is also a valuable testbed to evaluate natural language processing models, as it requires modelling complex grammatical aspects. In this work, we evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics, although still lagging in terms of splitting compliance. Results from two human evaluations further support the conclusions drawn from automated metric results. We provide a comprehensive study that includes prompting variants, domain shift, fine-tuned pretrained language models of varying parameter size and training data volumes, contrasted with both zero-shot and few-shot approaches on instruction-tuned language models. Although the latter were markedly outperformed by fine-tuned models, they may constitute a reasonable off-the-shelf alternative. Our results provide a fine-grained analysis of the potential and limitations of large language models for SPRP, with significant improvements achievable using relatively small amounts of training data and model parameters overall, and remaining limitations for all models on the task.

pdf bib abs

Automating Easy Read Text Segmentation
Jesús Calleja | Thierry Etchegoyhen | David Ponce
Findings of the Association for Computational Linguistics: EMNLP 2024

Easy Read text is one of the main forms of access to information for people with reading difficulties. One of the key characteristics of this type of text is the requirement to split sentences into smaller grammatical segments, to facilitate reading. Automated segmentation methods could foster the creation of Easy Read content, but their viability has yet to be addressed. In this work, we study novel methods for the task, leveraging masked and generative language models, along with constituent parsing. We conduct comprehensive automatic and human evaluations in three languages, analysing the strengths and weaknesses of the proposed alternatives, under scarce resource limitations. Our results highlight the viability of automated Easy Read segmentation and remaining deficiencies compared to expert-driven human segmentation.

Co-authors

Venues

Fix author