Sol Lago


2025

pdf bib
Quantifying word complexity for Leichte Sprache: A computational metric and its psycholinguistic validation
Umesh Patil | Jesus Calvillo | Sol Lago | Anne-Kathrin Schumann
Proceedings of the 1st Workshop on Artificial Intelligence and Easy and Plain Language in Institutional Contexts (AI & EL/PL)

Leichte Sprache (Easy Language or Easy German) is a strongly simplified version of German geared toward a target group with limited language proficiency. In Germany, public bodies are required to provide information in Leichte Sprache. Unfortunately, Leichte Sprache rules are traditionally defined by non-linguists, they are not rooted in linguistic research and they do not provide precise decision criteria or devices for measuring the complexity of linguistic structures (Bock and Pappert,2023). For instance, one of the rules simply recommends the usage of simple rather than complex words. In this paper we, therefore, propose a model to determine word complexity. We train an XGBoost model for classifying word complexity by leveraging word-level linguistic and corpus-level distributional features, frequency information from an in-house Leichte Sprache corpus and human complexity annotations. We psycholinguistically validate our model by showing that it captures human word recognition times above and beyond traditional word-level predictors. Moreover, we discuss a number of practical applications of our classifier, such as the evaluation of AI-simplified text and detection of CEFR levels of words. To our knowledge, this is one of the first attempts to systematically quantify word complexity in the context of Leichte Sprache and to link it directly to real-time word processing.

2022

pdf bib
Computational cognitive modeling of predictive sentence processing in a second language
Umesh Patil | Sol Lago
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

We propose an ACT-R cue-based retrieval model of the real-time gender predictions displayed by second language (L2) learners. The model extends a previous model of native (L1) speakers according to two central accounts in L2 sentence processing: (i) the Interference Hypothesis, which proposes that retrieval interference is higher in L2 than L1 speakers; (ii) the Lexical Bottleneck Hypothesis, which proposes that problems with gender agreement are due to weak gender representations. We tested the predictions of these accounts using data from two visual world experiments, which found that the gender predictions elicited by German possessive pronouns were delayed and smaller in size in L2 than L1 speakers. The experiments also found a “match effect”, such that when the antecedent and possessee of the pronoun had the same gender, predictions were earlier than when the two genders differed. This match effect was smaller in L2 than L1 speakers. The model implementing the Lexical Bottleneck Hypothesis captured the effects of smaller predictions, smaller match effect and delayed predictions in one of the two conditions. By contrast, the model implementing the Interference Hypothesis captured the smaller prediction effect but it showed an earlier prediction effect and an increased match effect in L2 than L1 speakers. These results provide evidence for the Lexical Bottleneck Hypothesis, and they demonstrate a method for extending computational models of L1 to L2 processing.