Martin Kappus


2026

Automatic text simplification (ATS) seeks to automate the process of rewording within the same language to enhance readability and comprehension. Current evaluation practices for ATS systems predominantly rely on automatic metrics or assessments by experts and crowdworkers, often excluding the intended end users and other stakeholders, and thus limiting insights into the actual effectiveness of ATS models. In this study, we address this gap by conducting a multi-faceted, mixed-method evaluation of two LLM-based ATS systems for German (capito.ai and GPT-4o) and by involving end users, post-editors, and Easy Language experts. The findings highlight the effectiveness of the LLM-based ATS systems examined across several dimensions, including post-editing efficiency, expert quality assessments, and, in the case of GPT-4o-generated simplifications, user comprehension. Post-editing effort metrics, in particular, show an increase in productivity of around 30% compared to full manual simplification. Moreover, the results reveal substantial differences in perception and understanding among participant groups. These outcomes clearly indicate that ATS for German has recently made considerable progress and, crucially, underscore the importance of incorporating multiple stakeholders into ATS evaluation to better align system performance with accessibility goals.

2025

2024

Text simplification refers to the process of rewording within a single language, moving from a standard form into an easy-to-understand one. Easy Language and Plain Language are two examples of simplified varieties aimed at improving readability and understanding for a wide-ranging audience. Human evaluation of automatic text simplification is usually done by employing experts or crowdworkers to rate the generated texts. However, this approach does not include the target readers of simplified texts and does not reflect actual comprehensibility. In this paper, we explore different ways of measuring the quality of automatically simplified texts. We conducted a multi-faceted evaluation study involving end users, post-editors, and Easy Language experts and applied a variety of qualitative and quantitative methods. We found differences in the perception and actual comprehension of the texts by different user groups. In addition, qualitative surveys and behavioral observations proved to be essential in interpreting the results.