2025
pdf
bib
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Ekaterina Kochmar
|
Bashar Alhafni
|
Marie Bexte
|
Jill Burstein
|
Andrea Horbach
|
Ronja Laarmann-Quante
|
Anaïs Tack
|
Victoria Yaneva
|
Zheng Yuan
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
pdf
bib
abs
Explaining Holistic Essay Scores in Comparative Judgment Assessments by Predicting Scores on Rubrics
Michiel De Vrindt
|
Renske Bouwer
|
Wim Van Den Noortgate
|
Marije Lesterhuis
|
Anaïs Tack
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Comparative judgment (CJ) is an assessment method in which multiple assessors determine the holistic quality of essays through pairwise comparisons.While CJ is recognized for generating reliable and valid scores, it falls short in providing transparency about the specific quality aspects these holistic scores represent.Our study addresses this limitation by predicting scores on a set of rubrics that measure text quality, thereby explaining the holistic scores derived from CJ.We developed feature-based machine learning models that leveraged complexity and genre features extracted from a collection of Dutch essays.We evaluated the predictability of rubric scores for text quality based on linguistic features.Subsequently, we evaluated the validity of the predicted rubric scores by examining their ability to explain the holistic scores derived from CJ.Our findings indicate that feature-based prediction models can predict relevant rubric scores moderately well. Furthermore, the predictions can be used to explain holistic scores from CJ, despite certain biases. This automated approach to explain holistic quality scores from CJ can enhance the transparency of CJ assessments and simplify the evaluation of their validity.
pdf
bib
abs
Can GPTZero’s AI Vocabulary Distinguish Between LLM-Generated and Student-Written Essays?
Veronica Schmalz
|
Anaïs Tack
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Despite recent advances in AI detection methods, their practical application, especially in education, remains limited. Educators need functional tools pointing to AI indicators within texts, rather than merely estimating whether AI was used. GPTZero’s new AI Vocabulary feature, which highlights parts of a text likely to be AI-generated based on frequent words and phrases from LLM-generated texts, offers a potential solution. However, its effectiveness has not yet been empirically validated.In this study, we examine whether GPTZero’s AI Vocabulary can effectively distinguish between LLM-generated and student-written essays. We analyze the AI Vocabulary lists published from October 2024 to March 2025 and evaluate them on a subset of the Ghostbuster dataset, which includes student and LLM essays. We train multiple Bag-of-Words classifiers using GPTZero’s AI Vocabulary terms as features and examine their individual contributions to classification.Our findings show that simply checking for the presence, not the frequency, of specific AI terms yields the best results, particularly with ChatGPT-generated essays. However, performance drops to near-random when applied to Claude-generated essays, indicating that GPTZero’s AI Vocabulary may not generalize well to texts generated by LLMs other than ChatGPT. Additionally, all classifiers based on GPTZero’s AI Vocabulary significantly underperform compared to Bag-of-Words classifiers trained directly on the full dataset vocabulary. These findings suggest that fixed vocabularies based solely on lexical features, despite their interpretability, have limited effectiveness across different LLMs and educational writing contexts.
pdf
bib
abs
Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors
Ekaterina Kochmar
|
Kaushal Maurya
|
Kseniia Petukhova
|
KV Aditya Srivatsa
|
Anaïs Tack
|
Justin Vasselli
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
This shared task has aimed to assess pedagogical abilities of AI tutors powered by large language models (LLMs), focusing on evaluating the quality of tutor responses aimed at student’s mistake remediation within educational dialogues. The task consisted of five tracks designed to automatically evaluate the AI tutor’s performance across key dimensions of mistake identification, precise location of the mistake, providing guidance, and feedback actionability, grounded in learning science principles that define good and effective tutor responses, as well as the track focusing on detection of the tutor identity. The task attracted over 50 international teams across all tracks. The submitted models were evaluated against gold-standard human annotations, and the results, while promising, show that there is still significant room for improvement in this domain: the best results for the four pedagogical ability assessment tracks range between macro F1 scores of 58.34 (for providing guidance) and 71.81 (for mistake identification) on three-class problems, with the best F1 score in the tutor identification track reaching 96.98 on a 9-class task. In this paper, we overview the main findings of the shared task, discuss the approaches taken by the teams, and analyze their performance. All resources associated with this task are made publicly available to support futureresearch in this critical domain (https://github.com/kaushal0494/UnifyingAITutorEvaluation/tree/main/BEA_Shared_Task_2025_Datasets).
2024
pdf
bib
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Ekaterina Kochmar
|
Marie Bexte
|
Jill Burstein
|
Andrea Horbach
|
Ronja Laarmann-Quante
|
Anaïs Tack
|
Victoria Yaneva
|
Zheng Yuan
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
pdf
bib
abs
Predicting Initial Essay Quality Scores to Increase the Efficiency of Comparative Judgment Assessments
Michiel De Vrindt
|
Anaïs Tack
|
Renske Bouwer
|
Wim Van Den Noortgate
|
Marije Lesterhuis
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Comparative judgment (CJ) is a method that can be used to assess the writing quality of student essays based on repeated pairwise comparisons by multiple assessors. Although the assessment method is known to have high validity and reliability, it can be particularly inefficient, as assessors must make many judgments before the scores become reliable. Prior research has investigated methods to improve the efficiency of CJ, yet these methods introduce additional challenges, notably stemming from the initial lack of information at the start of the assessment, which is known as a cold-start problem. This paper reports on a study in which we predict the initial quality scores of essays to establish a warm start for CJ. To achieve this, we construct informative prior distributions for the quality scores based on the predicted initial quality scores. Through simulation studies, we demonstrate that our approach increases the efficiency of CJ: On average, assessors need to make 30% fewer judgments for each essay to reach an overall reliability level of 0.70.
pdf
bib
abs
ITEC at BEA 2024 Shared Task: Predicting Difficulty and Response Time of Medical Exam Questions with Statistical, Machine Learning, and Language Models
Anaïs Tack
|
Siem Buseyne
|
Changsheng Chen
|
Robbe D’hondt
|
Michiel De Vrindt
|
Alireza Gharahighehi
|
Sameh Metwaly
|
Felipe Kenji Nakano
|
Ann-Sophie Noreillie
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
This paper presents the results of our participation in the BEA 2024 shared task on the automated prediction of item difficulty and item response time (APIDIRT), hosted by the NBME (National Board of Medical Examiners). During this task, practice multiple-choice questions from the United States Medical Licensing Examination® (USMLE®) were shared, and research teams were tasked with devising systems capable of predicting the difficulty and average response time for new exam questions.Our team, part of the interdisciplinary itec research group, participated in the task. We extracted linguistic features and clinical embeddings from question items and tested various modeling techniques, including statistical regression, machine learning, language models, and ensemble methods. Surprisingly, simplermodels such as Lasso and random forest regression, utilizing principal component features from linguistic and clinical embeddings, outperformed more complex models. In the competition, our random forest model ranked 4th out of 43 submissions for difficulty prediction, while the Lasso model secured the 2nd position out of 34 submissions for response time prediction. Further analysis suggests that had we submitted the Lasso model for difficulty prediction, we would have achieved an even higher ranking. We also observed that predicting response time is easier than predicting difficulty, with features such as item length, type, exam step, and analytical thinking influencing response time prediction more significantly.
pdf
bib
abs
ITEC at MLSP 2024: Transferring Predictions of Lexical Difficulty from Non-Native Readers
Anaïs Tack
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
This paper presents the results of our team’s participation in the BEA 2024 shared task on the multilingual lexical simplification pipeline (MLSP; Shardlow et al., 2024). During the task, organizers supplied data that combined two components of the simplification pipeline: lexical complexity prediction and lexical substitution. This dataset encompassed ten languages, including French. Given the absence of dedicated training data, teams were challenged with employing systems trained on pre-existing resources and evaluating their performance on unexplored test data.Our team contributed to the task using previously developed models for predicting lexical difficulty in French (Tack, 2021). These models were built on deep learning architectures, adding to our participation in the CWI 2018 shared task (De Hertog and Tack, 2018). The training dataset comprised 262,054 binary decision annotations, capturing perceived lexical difficulty, collected from a sample of 56 non-native French readers. Two pre-trained neural logistic models were used: (1) a model for predicting difficulty for words within their sentence context, and (2) a model for predicting difficulty for isolated words.The findings revealed that despite being trained for a distinct prediction task (as indicated by a negative R2 fit), transferring the logistic predictions of lexical difficulty to continuous scores of lexical complexity exhibited a positive correlation. Specifically, the results indicated that isolated predictions exhibited a higher correlation (r = .36) compared to contextualized predictions (r = .33). Moreover, isolated predictions demonstrated a remarkably higher Spearman rank correlation (ρ = .50) than contextualized predictions (ρ = .35). These results align with earlier observations by Tack (2021), suggesting that the ground truth primarily captures more lexical access difficulties than word-to-context integration problems.
2023
pdf
bib
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
Ekaterina Kochmar
|
Jill Burstein
|
Andrea Horbach
|
Ronja Laarmann-Quante
|
Nitin Madnani
|
Anaïs Tack
|
Victoria Yaneva
|
Zheng Yuan
|
Torsten Zesch
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
pdf
bib
abs
The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues
Anaïs Tack
|
Ekaterina Kochmar
|
Zheng Yuan
|
Serge Bibauw
|
Chris Piech
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
This paper describes the results of the first shared task on generation of teacher responses in educational dialogues. The goal of the task was to benchmark the ability of generative language models to act as AI teachers, replying to a student in a teacher-student dialogue. Eight teams participated in the competition hosted on CodaLab and experimented with a wide variety of state-of-the-art models, including Alpaca, Bloom, DialoGPT, DistilGPT-2, Flan-T5, GPT- 2, GPT-3, GPT-4, LLaMA, OPT-2.7B, and T5- base. Their submissions were automatically scored using BERTScore and DialogRPT metrics, and the top three among them were further manually evaluated in terms of pedagogical ability based on Tack and Piech (2022). The NAISTeacher system, which ranked first in both automated and human evaluation, generated responses with GPT-3.5 Turbo using an ensemble of prompts and DialogRPT-based ranking of responses for given dialogue contexts. Despite promising achievements of the participating teams, the results also highlight the need for evaluation metrics better suited to educational contexts.
2022
pdf
bib
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)
Ekaterina Kochmar
|
Jill Burstein
|
Andrea Horbach
|
Ronja Laarmann-Quante
|
Nitin Madnani
|
Anaïs Tack
|
Victoria Yaneva
|
Zheng Yuan
|
Torsten Zesch
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)
pdf
bib
abs
FABRA: French Aggregator-Based Readability Assessment toolkit
Rodrigo Wilkens
|
David Alfter
|
Xiaoou Wang
|
Alice Pintard
|
Anaïs Tack
|
Kevin P. Yancey
|
Thomas François
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this paper, we present the FABRA: readability toolkit based on the aggregation of a large number of readability predictor variables. The toolkit is implemented as a service-oriented architecture, which obviates the need for installation, and simplifies its integration into other projects. We also perform a set of experiments to show which features are most predictive on two different corpora, and how the use of aggregators improves performance over standard feature-based readability prediction. Our experiments show that, for the explored corpora, the most important predictors for native texts are measures of lexical diversity, dependency counts and text coherence, while the most important predictors for foreign texts are syntactic variables illustrating language development, as well as features linked to lexical sophistication. FABRA: have the potential to support new research on readability assessment for French.
2020
pdf
bib
abs
Alector: A Parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers
Núria Gala
|
Anaïs Tack
|
Ludivine Javourey-Drevet
|
Thomas François
|
Johannes C. Ziegler
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we present a new parallel corpus addressed to researchers, teachers, and speech therapists interested in text simplification as a means of alleviating difficulties in children learning to read. The corpus is composed of excerpts drawn from 79 authentic literary (tales, stories) and scientific (documentary) texts commonly used in French schools for children aged between 7 to 9 years old. The excerpts were manually simplified at the lexical, morpho-syntactic, and discourse levels in order to propose a parallel corpus for reading tests and for the development of automatic text simplification tools. A sample of 21 poor-reading and dyslexic children with an average reading delay of 2.5 years read a portion of the corpus. The transcripts of readings errors were integrated into the corpus with the goal of identifying lexical difficulty in the target population. By means of statistical testing, we provide evidence that the manual simplifications significantly reduced reading errors, highlighting that the words targeted for simplification were not only well-chosen but also substituted with substantially easier alternatives. The entire corpus is available for consultation through a web interface and available on demand for research purposes.
2018
pdf
bib
abs
A Report on the Complex Word Identification Shared Task 2018
Seid Muhie Yimam
|
Chris Biemann
|
Shervin Malmasi
|
Gustavo Paetzold
|
Lucia Specia
|
Sanja Štajner
|
Anaïs Tack
|
Marcos Zampieri
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification and probabilistic classification. A total of 12 teams submitted their results in different task/track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.
pdf
bib
abs
NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet
Anaïs Tack
|
Thomas François
|
Piet Desmet
|
Cédrick Fairon
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
In this paper, we introduce NT2Lex, a novel lexical resource for Dutch as a foreign language (NT2) which includes frequency distributions of 17,743 words and expressions attested in expert-written textbook texts and readers graded along the scale of the Common European Framework of Reference (CEFR). In essence, the lexicon informs us about what kind of vocabulary should be understood when reading Dutch as a non-native reader at a particular proficiency level. The main novelty of the resource with respect to the previously developed CEFR-graded lexicons concerns the introduction of corpus-based evidence for L2 word sense complexity through the linkage to Open Dutch WordNet (Postma et al., 2016). The resource thus contains, on top of the lemmatised and part-of-speech tagged lexical entries, a total of 11,999 unique word senses and 8,934 distinct synsets.
pdf
bib
abs
Deep Learning Architecture for Complex Word Identification
Dirk De Hertog
|
Anaïs Tack
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
We describe a system for the CWI-task that includes information on 5 aspects of the (complex) lexical item, namely distributional information of the item itself, morphological structure, psychological measures, corpus-counts and topical information. We constructed a deep learning architecture that combines those features and apply it to the probabilistic and binary classification task for all English sets and Spanish. We achieved reasonable performance on all sets with best performances seen on the probabilistic task, particularly on the English news set (MAE 0.054 and F1-score of 0.872). An analysis of the results shows that reasonable performance can be achieved with a single architecture without any domain-specific tweaking of the parameter settings and that distributional features capture almost all of the information also found in hand-crafted features.
2017
pdf
bib
abs
Human and Automated CEFR-based Grading of Short Answers
Anaïs Tack
|
Thomas François
|
Sophie Roekhaut
|
Cédrick Fairon
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
This paper is concerned with the task of automatically assessing the written proficiency level of non-native (L2) learners of English. Drawing on previous research on automated L2 writing assessment following the Common European Framework of Reference for Languages (CEFR), we investigate the possibilities and difficulties of deriving the CEFR level from short answers to open-ended questions, which has not yet been subjected to numerous studies up to date. The object of our study is twofold: to examine the intricacy involved with both human and automated CEFR-based grading of short answers. On the one hand, we describe the compilation of a learner corpus of short answers graded with CEFR levels by three certified Cambridge examiners. We mainly observe that, although the shortness of the answers is reported as undermining a clear-cut evaluation, the length of the answer does not necessarily correlate with inter-examiner disagreement. On the other hand, we explore the development of a soft-voting system for the automated CEFR-based grading of short answers and draw tentative conclusions about its use in a computer-assisted testing (CAT) setting.
2016
pdf
bib
abs
Modèles adaptatifs pour prédire automatiquement la compétence lexicale d’un apprenant de français langue étrangère (Adaptive models for automatically predicting the lexical competence of French as a foreign language learners)
Anaïs Tack
|
Thomas François
|
Anne-Laure Ligozat
|
Cédrick Fairon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)
Cette étude examine l’utilisation de méthodes d’apprentissage incrémental supervisé afin de prédire la compétence lexicale d’apprenants de français langue étrangère (FLE). Les apprenants ciblés sont des néerlandophones ayant un niveau A2/B1 selon le Cadre européen commun de référence pour les langues (CECR). À l’instar des travaux récents portant sur la prédiction de la maîtrise lexicale à l’aide d’indices de complexité, nous élaborons deux types de modèles qui s’adaptent en fonction d’un retour d’expérience, révélant les connaissances de l’apprenant. En particulier, nous définissons (i) un modèle qui prédit la compétence lexicale de tous les apprenants du même niveau de maîtrise et (ii) un modèle qui prédit la compétence lexicale d’un apprenant individuel. Les modèles obtenus sont ensuite évalués par rapport à un modèle de référence déterminant la compétence lexicale à partir d’un lexique spécialisé pour le FLE et s’avèrent gagner significativement en exactitude (9%-17%).
pdf
bib
abs
SVALex: a CEFR-graded Lexical Resource for Swedish Foreign and Second Language Learners
Thomas François
|
Elena Volodina
|
Ildikó Pilán
|
Anaïs Tack
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The paper introduces SVALex, a lexical resource primarily aimed at learners and teachers of Swedish as a foreign and second language that describes the distribution of 15,681 words and expressions across the Common European Framework of Reference (CEFR). The resource is based on a corpus of coursebook texts, and thus describes receptive vocabulary learners are exposed to during reading activities, as opposed to productive vocabulary they use when speaking or writing. The paper describes the methodology applied to create the list and to estimate the frequency distribution. It also discusses some characteristics of the resulting resource and compares it to other lexical resources for Swedish. An interesting feature of this resource is the possibility to separate the wheat from the chaff, identifying the core vocabulary at each level, i.e. vocabulary shared by several coursebook writers at each level, from peripheral vocabulary which is used by the minority of the coursebook writers.
pdf
bib
abs
Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource
Anaïs Tack
|
Thomas François
|
Anne-Laure Ligozat
|
Cédrick Fairon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This study examines two possibilities of using the FLELex graded lexicon for the automated assessment of text complexity in French as a foreign language learning. From the lexical frequency distributions described in FLELex, we derive a single level of difficulty for each word in a parallel corpus of original and simplified texts. We then use this data to automatically address the lexical complexity of texts in two ways. On the one hand, we evaluate the degree of lexical simplification in manually simplified texts with respect to their original version. Our results show a significant simplification effect, both in the case of French narratives simplified for non-native readers and in the case of simplified Wikipedia texts. On the other hand, we define a predictive model which identifies the number of words in a text that are expected to be known at a particular learning level. We assess the accuracy with which these predictions are able to capture actual word knowledge as reported by Dutch-speaking learners of French. Our study shows that although the predictions seem relatively accurate in general (87.4% to 92.3%), they do not yet seem to cover the learners’ lack of knowledge very well.