Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)

Matthew Shardlow, Fernando Alva-Manchego, Kai North, Regina Stodden, Horacio Saggion, Nouran Khallaf, Akio Hayakawa (Editors)

Anthology ID:: 2025.tsar-1
Month:: November
Year:: 2025
Address:: Suzhou, China
Venues:: TSAR | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.tsar-1/
DOI:
ISBN:: 979-8-89176-176-6
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.tsar-1.pdf

PDF (full) BibTeX Search

pdf bib abs
Template-Based Text-to-Image Alignment for Language Accessibility A Study on Visualizing Text Simplifications
Belkiss Souayed | Sarah Ebling | Yingqiang Gao

Individuals with intellectual disabilities often have difficulties in comprehending complex texts. While many text-to-image models prioritize photorealism over cognitive accessibility it is not clear how visual illustrations relate to text simplifications TS generated from them. This paper presents a structured vision language model VLM prompting framework for generating cognitively accessible images from simplified texts. We designed five prompt templates i.e. Basic Object Focus Contextual Scene Educational Layout Multi-Level Detail and Grid Layout each following distinct spatial arrangements while adhering to accessibility constraints such as object count limits spatial separation and content restrictions. Using 400 sentence-level TS pairs from four established text simplification datasets OneStopEnglish SimPA Wikipedia ASSET we conducted a two-phase evaluation Phase 1 assessed template effectiveness with CLIP similarity scores and Phase 2 involved expert annotation of generated images across ten visual styles by four accessibility specialists. Results show that the Basic Object Focus template achieved the highest semantic alignment indicating that visual minimalism enhances accessibility. Expert evaluation further identified Retro style as the most accessible and Wikipedia as the most effective text source. Inter-annotator agreement varied across dimensions with Text Simplicity showing strong reliability and Image Quality proving more subjective. Overall our framework offers practical guidelines for accessible content creation and underscores the importance of structured prompting in AI-generated visual accessibility tools.

pdf bib abs
Document-level Simplification and Illustration Generation Multimodal Coherence
Yuhang Liu | Mo Zhang | Zhaoyi Cheng | Sarah Ebling

We present a novel method for document-level text simplification and automatic illustration generation aimed at enhancing information accessibility for individuals with cognitive impairments. While prior research has primarily focused on sentence- or paragraph-level simplification and text-to-image generation for narrative contexts this work addresses the unique challenges of simplifying long-form documents and generating semantically aligned visuals. The pipeline consists of three stages (1) discourse-aware segmentation using large language models (2) visually grounded description generation via abstraction and (3) controlled image synthesis using state-of-the-art diffusion models including DALLE 3 and FLUX1-dev. We further incorporate stylistic constraints to ensure visual coherence and we conduct a human evaluation measuring comprehension semantic alignment and visual clarity. Experimental results demonstrate that our method effectively combines simplified text and visual content with generated illustrations enhancing textual accessibility.

pdf bib abs
Medical Text Simplification From Jargon Detection to Jargon-Aware Prompting
Taiki Papandreou | Jan Bakker | Jaap Kamps

Jargon identification is critical for improving the accessibility of biomedical texts yet models are often evaluated on isolated datasets leaving open questions about generalization. After reproducing MedReadMes jargon detection results and extending evaluation to the PLABA dataset we find that transfer learning across datasets yields only modest gains largely due to divergent annotation objectives. Through manual re-annotation we show that aligning labeling schemes improves cross-dataset performance. Building on these findings we evaluate several jargon-aware prompting strategies for LLM-based medical text simplification. Explicitly highlighting jargon in prompts does not consistently improve simplification quality. When gains occur they often trade off against readability and are model-dependent. Human evaluation indicates that simple prompting can be as effective as more complex jargon-aware instructions. We release code to facilitate further research https//anonymous.4open.science/r/tsar-anonymous-2D66F/README.md

pdf bib abs
Readability Reconsidered A Cross-Dataset Analysis of Reference-Free Metrics
Catarina Belem | Parker Glenn | Alfy Samuel | Anoop Kumar | Daben Liu

Automatic readability assessment plays a key role in ensuring effective communication between humans and language models. Despite significant progress the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work we investigate the factors shaping human perceptions of readability through the analysis of 1.2k judgments finding that beyond surface-level cues information content and topic strongly shape text comprehensibility. Furthermore we evaluate 15 popular readability metrics across 5 datasets contrasting them with 5 more nuanced model-based metrics. Our results show that four model-based metrics consistently place among the top 4 in rank correlations with human judgments while the best performing traditional metric achieves an average rank of 7.8. These findings highlight a mismatch between current readability metrics and human perceptions pointing to model-based approaches as a more promising direction.

pdf bib abs
Evaluating Health Question Answering Under Readability-Controlled Style Perturbations
Md Mushfiqur Rahman | Kevin Lybarger

Patients often ask semantically similar medical questions in linguistically diverse ways that vary in readability tone and background knowledge. A robust question answering QA system should both provide semantically consistent answers across stylistic differences and adapt its response style to match the users input however existing QA evaluations rarely test this capability creating critical gaps in QA evaluation that undermine accessibility and health literacy. We introduce SPQA an evaluation framework and benchmark that applies controlled stylistic perturbations to consumer health questions while preserving semantic intent then measures how model answers change across correctness completeness coherence fluency and linguistic adaptability using a human-validated LLM-based judge. The style axes include reading level formality and patient background knowledge all perturbations are grounded in human annotations to ensure fidelity and alignment with human judgments. Our contributions include a readability-aware evaluation methodology a style-diverse benchmark with human-grounded perturbations and an automated evaluation pipeline validated against expert judgments. Evaluation results across multiple health QA models indicate that stylistic perturbations lead to measurable performance degradation even when semantic intent is preserved during perturbation. The largest performance drops occur in answer correctness and completeness while models also show limited ability to adapt their style to match the input. These findings underscore the risk of inequitable information delivery and highlight the need for accessibility-aware QA evaluation.

pdf bib abs
A Multi-Agent Framework with Diagnostic Feedback for Iterative Plain Language Summary Generation from Cochrane Medical Abstracts
Felipe Arias Russi | Carolina Salazar Lara | Ruben Manrique

Plain Language Summaries PLS improve health literacy and enable informed healthcare decisions but writing them requires domain expertise and is time-consuming. Automated methods often prioritize efficiency over comprehension and medical documents unique simplification requirements challenge generic solutions. We present a multi-agent system for generating PLS using Cochrane PLS as proof of concept. The system uses specialized agents for information extraction writing diagnosis and evaluation integrating a medical glossary and statistical analyzer to guide revisions. We evaluated three architectural configurations on 100 Cochrane abstracts using six LLMs both proprietary and open-source. Results reveal model-dependent trade-offs between factuality and readability with the multi-agent approach showing improvements for smaller models and providing operational advantages in control and interpretability.

pdf bib abs
Efficient On-Device Text Simplification for Firefox with Synthetic Data Fine-Tuning
Pablo Romero | Zihao Li | Matthew Shardlow

This work presents a system for on-device text simplification that enables users to process sensitive text without relying on cloud-based services. Through the use of quantization techniques and a novel approach to controllable text simplification we reduce model size by up to 75 percent with minimal performance degradation. Our models demonstrate efficient state-of-the-art results using a synthetic dataset of 2909 examples outperforming prior work trained on 300K examples. This efficiency stems from (1) a single control token strategy that precisely targets specific reading levels (2) a contrastive training approach that enriches model understanding through exposure to multiple simplification levels and (3) individual models that dedicate full parameter capacity to specific reading level transformations. Our best models achieve up to 82.18 BLEU at the Advanced level and 46.12 SARI at the Elementary level on standard benchmarks with performance preserved even after aggressive quantization. This work is implemented as a collaboration with the Mozilla AI team to process text entirely locally ensuring sensitive information never leaves the users device. We have a demonstration video https//youtu.be/TzmaxnARMzg and a web demo available at https//pablorom2004.github.io/Simplification-Web-Demo

This paper presents the findings of the first Shared Task on Readability-Controlled Text Simplification at TSAR 2025. The task required systems to simplify English texts to specific target readability levels of the Common European Framework of Reference for Languages (CEFR). We received 48 submissions from 20 participating teams, with approaches predominantly based on large language models (LLMs), which included iterative refinement, multi-agent setups, and LLM-as-a-judge pipelines. For this shared task, we developed a new dataset of pedagogical texts and evaluated submissions using a weighted combination of semantic similarity and CEFR-level accuracy. The results of the participating teams demonstrate that while LLMs can perform substantially well on this task, dependable and controlled simplification often requires complex, multi-iterative processes. Our findings also suggest that the capabilities of current systems are beginning to saturate existing automatic evaluation metrics, underscoring the need for reevaluation and practicality.

pdf bib abs
OneNRC@TSAR2025 Shared Task Small Models for Readability Controlled Text Simplification
Sowmya Vajjala

In this system description paper, we describe the team OneNRC’s experiments on readability controlled text simplification, focused on using smaller, quantized language models (<20B). We compare these with one large proprietary model and show that the smaller models offer comparable or even better results in some experimental settings. The approach primarily comprises of prompt optimization, agentic workflow, and tool calling. The best results were achieved while using a CEFR proficiency classifier as a verification tool for the language model agent. In terms of comparison with other systems, our submission that used a quantized Gemma3:12B model that ran on a laptop achieved a rank of 9.88 among the submitted systems as per the AUTORANK framework used by the organizers. We hope these results will lead into further exploration on the usefulness of smaller models for text simplification.

pdf bib abs
GRIPF at TSAR 2025 Shared Task Towards controlled CEFR level simplification with the help of inter-model interactions
David Alfter | Sebastian Gombert

In this contribution to the CEFR level simplification TSAR 2025 Shared Task, we propose two systems, EZ-SCALAR and SAGA, that implement two differing approaches to prompting LLMs for proficiency-adapted simplification. Our results place us in the middle of the participating teams, and reveal that using external lexical resources to guide simplification improves overall results.

pdf bib abs
ITU NLP at TSAR 2025 Shared Task A Three-Stage Prompting Approach for CEFR-Oriented Text Simplification
Kutay Arda Dinç | Fatih Bektaş | Gülşen Eryiğit

Automatic Text Simplification (TS) makes complex texts more accessible but often lacks control over target readability levels. We propose a lightweight, prompt-based approach to English TS that explicitly aligns outputs with CEFR proficiency standards. Our method employs a three-stage pipeline, guided by rule-informed prompts inspired by expert strategies. In the TSAR 2025 Shared Task, our system achieved competitive performance, with stronger results at B1 level and challenges at A2 level due to over-simplification. These findings highlight the promise of prompt-based CEFR-oriented simplification and the need for more flexible constraint design.

pdf bib abs
STARLING at TSAR 2025 Shared Task Leveraging Alternative Generations for Readability Level Adjustment in Text Simplification
Piotr Przybyła

Readability adjustment is crucial in text simplification, as it allows to generate language appropriate to the needs of a particular group of readers. Here we present a method for simplifying a text fragment that aims for a given CEFR level, e.g. A2 or B1. The proposed approach combines prompted large language model with sentence-level adjustment of difficulty level. The work is evaluated within the framework of TSAR 2025 shared task, showing a trade-off between precise readability adjustment and faithful meaning preservation.

pdf bib abs
taskGen at TSAR 2025 Shared Task Exploring prompt strategies with linguistic knowledge
Juan Cruz Oviedo | Elisabet Comelles Pujadas | Laura Alonso Alemany | Jordi Atserias Batalla

TaskGen ranked as 6th best team in the TSAR 2025 shared task for English text adaptation to a target CEFR level. Our experiments consisted of prompting a Llama-3.1-8B-Instruct model with linguistic descriptors of the target level, examples of adaptations and multi-step approaches. Our best run, 13th in the overall ranking, applied an ensemble strategy using a voting mechanism to find the most adequate among 10 texts, each produced by a different prompting strategy.

pdf bib abs
EasyJon at TSAR 2025 Shared Task Evaluation of Automated Text Simplification with LLM-as-a-Judge
Paul-Gerhard Barbu | Adrianna Lipska-Dieck | Lena Lindner

This paper presents an approach to automated text simplification for CEFR A2 and B1 levels using large language models and prompt engineering. We evaluate seven models across three prompting strategies short, descriptive, and descriptive with examples. A two-round evaluation system using LLM-as-a-Judge and traditional metrics for text simplification determines optimal model-prompt combinations for final submissions. Results demonstrate that descriptive prompts consistently outperform other strategies across all models, achieving 46-65% of first-place rankings. Qwen3 shows superior performance for A2-level simplification, while B1-level results are more balanced across models. The LLM-as-a-Judge evaluation method shows strong alignment with traditional metrics while providing enhanced explainability.

pdf bib abs
HULAT-UC3M at TSAR 2025 Shared Task A Prompt-Based Approach using Lightweight Language Models for Readability-Controlled Text Simplification
Jesus M. Sanchez-Gomez | Lourdes Moreno | Paloma Martínez | Marco Antonio Sanchez-Escudero

This paper describes the participation of the HULAT-UC3M team in the TSAR 2025 Shared Task on Readability-Controlled Text Simplification. Our approach uses open and lightweight Large Language Models (LLMs) with different sizes, together with two strategies for prompt engineering. The proposed system has been tested on the trial data provided, and evaluated using the official metrics CEFR Compliance, Meaning Preservation, and Similarity to References. LLaMA 3 8B model with reinforced prompts was selected as our final proposal for submission, and ranking fourteenth according to the overall metric. Finally, we discuss the main challenges that we identified in developing our approach for this task.

pdf bib abs
UoL-UPF at TSAR 2025 Shared Task A Generate-and-Select Approach for Readability-Controlled Text Simplification
Akio Hayakawa | Nouran Khallaf | Horacio Saggion | Serge Sharoff

The TSAR 2025 Shared Task on Readability-Controlled Text Simplification focuses on simplifying English paragraphs written at an advanced level (B2 or higher) and rewriting them to target CEFR levels (A2 or B1). The challenge is to reduce linguistic complexity without sacrificing coherence or meaning. We developed three complementary approaches based on large language models (LLMs). The first approach (Run 1) generates a diverse set of paragraph-level simplifications. It then applies filters to enforce CEFR alignment, preserve meaning, and encourage diversity, and finally selects the candidates with the lowest perceived risk. The second (Run 2) performs simplification at the sentence level, combining structured prompting, coreference resolution, and explainable AI techniques to highlight influential phrases, with candidate selection guided by automatic and LLM-based judges. The third hybrid approach (Run 3) integrates both strategies by pooling paragraph- and sentence-level simplifications, and subsequently applying the identical filtering and selection architecture used in Run 1. In the official TSAR evaluation, the hybrid system ranked 2nd overall, while its component systems also achieved competitive results.

pdf bib abs
Uniandes at TSAR 2025 Shared Task Multi-Agent CEFR Text Simplification with Automated Quality Assessment and Iterative Refinement
Felipe Arias Russi | Kevin Cohen Solano | Ruben Manrique

We present an agent-based system for the TSAR 2025 Shared Task on Readability-Controlled Text Simplification, which requires simplifying English paragraphs from B2+ levels to target A2 or B1 levels while preserving meaning. Our approach employs specialized agents for keyword extraction, text generation, and evaluation, coordinated through an iterative refinement loop. The system integrates a CEFR vocabulary classifier, pretrained evaluation models, and few-shot learning from trial data. Through iterative feedback between the evaluator and writer agents, our system automatically refines outputs until they meet both readability and semantic preservation constraints. This architecture achieved 4th position among participating teams, showing the effectiveness of combining specialized LLMs with automated quality control strategies for text simplification.

pdf bib abs
EhiMeNLP at TSAR 2025 Shared Task Candidate Generation via Iterative Simplification and Reranking by Readability and Semantic Similarity
Rina Miyata | Koki Horiguchi | Risa Kondo | Yuki Fujiwara | Tomoyuki Kajiwara

We introduce the EhiMeNLP submission, which won the TSAR 2025 Shared Task on Readability-Controlled Text Simplification. Our system employed a two-step strategy of candidate generation and reranking. For candidate generation, we simplified the given text into more readable versions by combining multiple large language models with prompts. Then, for reranking, we selected the best candidate by readability-based filtering and ranking based on semantic similarity to the original text.

pdf bib abs
OUNLP at TSAR 2025 Shared Task Multi-Round Text Simplifier via Code Generation
Cuong Huynh | Jie Cao

This paper describes the system submission of our team OUNLP to the TSAR-2025 shared task on readability-controlled text simplification. Based on the analysis of prompt-based text simplification methods, we discovered that simplification performance is highly related to the gap between the source CEFR level and the target CEFR level. Inspired by this finding, we propose two multi-round simplification methods generated via GPT-4o rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint). Our submitted systems ranked 7th out of 20 teams. Later improvements with MRS-Joint show that taking the LLM simplified candidates as the starting point could further boost multi-round simplification performance.

pdf bib abs
HIT-YOU at TSAR 2025 Shared Task Leveraging Similarity-Based Few-Shot Prompting, Round-Trip Translation, and Self-Refinement for Readability-Controlled Text Simplification
Mao Shimada | Kexin Bian | Zhidong Ling | Mamoru Komachi

We describe our submission to the TSAR 2025 shared task on readability-controlled text simplification, which evaluates systems on their ability to adjust linguistic complexity to specified CEFR levels while preserving meaning and coherence. We explored two complementary frameworks leveraging the shared task CEFR classifier as feedback. The first is an ensemble approach generating diverse candidates using multiple LLMs under zero-shot prompting with level-specific instructions and vocabulary lists, one-shot prompting, and round-trip translation. Candidates were filtered by predicted CEFR level before an LLM judge selected the final output. The second framework is a self-refinement loop, where a single candidate is iteratively revised with classifier feedback until matching the target level or reaching a maximum number of iterations. This study is among the first to apply round-trip translation and iterative self-refinement to controlled simplification, broadening the toolkit for adapting linguistic complexity.

pdf bib abs
SQUREL at TSAR 2025 Shared Task CEFR-Controlled Text Simplification with Prompting and Reinforcement Fine-Tuning
Daria Sokova | Anastasiia Bezobrazova | Constantin Orasan

This paper summarises the submissions of our team to the TSAR 2025 Shared Task on Readability-Controlled Text Simplification, which aims to create text simplifications balancing reduced linguistic complexity, meaning preservation, and fluency while meeting predefined target readability levels. We tested two different methods for CEFR-controlled simplification a conservative lexical pipeline relying on prompting LLMs to simplify sentences, and a setup employing reinforcement fine-tuning.

pdf bib abs
Archaeology at TSAR 2025 Shared Task Teaching Small Models to do CEFR Simplifications
Rares-Alexandru Roscan | Sergiu Nisioi

Large language models (LLMs) have demonstrated strong performance in text simplification tasks, but their high computational cost and proprietary nature often limit practical use, especially in education. We explore open-source LLMs for CEFR-level text simplification. By reducing model size and computational requirements, our approach enables greater accessibility and deployment in educational environments. Our results show some of the lowest error rates in producing CEFR-compliant texts at TSAR 2025, using models with 8 billion and 1 billion parameters. Such approaches have the potential to democratize NLP technologies for real-world applications.

pdf bib abs
HOPE at TSAR 2025 Shared Task Balancing Control and Complexity in Readability-Controlled Text Simplification
Sujal Maharjan | Astha Shrestha

This paper describes our submissions to the TSAR 2025 Shared Task on Readability-Controlled Text Simplification. We present a comparative study of three architectures a rule-based baseline, a heuristic-driven expert system, and a zero-shot generative T5 pipeline with a semantic guardrail. Our analysis shows a trade-off between the controllability of rule-based systems and the fluency of generative models. In this zero-shot setting, simpler, confined systems achieved superior meaning preservation scores compared to the more powerful but less predictable generative model. We present a diagnostic failure analysis on system outputs, illustrating how different architectures result in distinct error patterns such as under-simplification, information loss via heuristics, and semantic drift.

pdf bib abs
Know-AI at TSAR 2025 Shared Task Difficulty-aware Text Simplification System
Yiheng Wu | Anisia Katinskaia | Jue Hou | Roman Yangarber

Text simplification is an active research topic with applications in multiple domains. In a simplification pipeline, assessment of text difficulty plays a crucial role as a quality control mechanism it acts as a critic and guides models to generate text at the difficulty level required by the user. This paper presents our Difficulty-aware Text Simplification System. We evaluate our pipeline using the TSAR shared task dataset and discuss challenges in constructing corpora for training models to assess text difficulty.