2025
pdf
bib
abs
Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring
Mina Almasi
|
Ross Kristensen-McLachlan
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
This paper investigates the potentials of Large Language Models (LLMs) as adaptive tutors in the context of second-language learning. In particular, we evaluate whether system prompting can reliably constrain LLMs to generate only text appropriate to the student’s competence level. We simulate full teacher-student dialogues in Spanish using instruction-tuned, open-source LLMs ranging in size from 7B to 12B parameters. Dialogues are generated by having an LLM alternate between tutor and student roles with separate chat histories. The output from the tutor model is then used to evaluate the effectiveness of CEFR-based prompting to control text difficulty across three proficiency levels (A1, B1, C1). Our findings suggest that while system prompting can be used to constrain model outputs, prompting alone is too brittle for sustained, long-term interactional contexts - a phenomenon we term alignment drift. Our results provide insights into the feasibility of LLMs for personalized, proficiency aligned adaptive tutors and provide a scalable method for low-cost evaluation of model performance without human participants.
pdf
bib
abs
I only read it for the plot! Maturity Ratings Affect Fanfiction Style and Community Engagement
Mia Jacobsen
|
Ross Kristensen-McLachlan
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
We consider the textual profiles of different fanfiction maturity ratings, how they vary across fan groups, and how this relates to reader engagement metrics. Previous studies have shown that fanfiction writing is motivated by a combination of admiration for and frustration with the fan object. These findings emerge when looking at fanfiction as a whole, as well as when it is divided into subgroups, also called fandoms. However, maturity ratings are used to indicate the intended audience of the fanfiction, as well as whether the story includes mature themes and explicit scenes. Since these ratings can be used to filter readers and writers, they can also be seen as a proxy for different reader/writer motivations and desires. We find that explicit fanfiction in particular has a distinct textual profile when compared to other maturity ratings. These findings thus nuance our understanding of reader/writer motivations in fanfiction communities, and also highlights the influence of the community norms and fan behavior more generally on these cultural products.
2024
pdf
bib
abs
A New Benchmark for Kalaallisut-Danish Neural Machine Translation
Ross Kristensen-Mclachlan
|
Johanne Nedergård
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Kalaallisut, also known as (West) Greenlandic, poses a number of unique challenges to contemporary natural language processing (NLP). In particular, the language has historically lacked benchmarking datasets and robust evaluation of specific NLP tasks, such as neural machine translation (NMT). In this paper, we present a new benchmark dataset for Greenlandic to Danish NMT comprising over 1.2m words of Greenlandic and 2.1m words of parallel Danish translations. We provide initial metrics for models trained on this dataset and conclude by suggesting how these findings can be taken forward to other NLP tasks for the Greenlandic language.
2023
pdf
bib
abs
DanSumT5: Automatic Abstractive Summarization for Danish
Sara Kolding
|
Katrine Nymann
|
Ida Hansen
|
Kenneth Enevoldsen
|
Ross Kristensen-McLachlan
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Automatic abstractive text summarization is a challenging task in the field of natural language processing. This paper presents a model for domain-specific sum marization for Danish news articles, Dan SumT5; an mT5 model fine-tuned on a cleaned subset of the DaNewsroom dataset consisting of abstractive summary-article pairs. The resulting state-of-the-art model is evaluated both quantitatively and qualitatively, using ROUGE and BERTScore metrics and human rankings of the summaries. We find that although model refinements increase quantitative and qualitative performance, the model is still prone to factual errors. We discuss the limitations of current evaluation methods for automatic abstractive summarization and underline the need for improved metrics and transparency within the field. We suggest that future work should employ methods for detecting and reducing errors in model output and methods for referenceless evaluation of summaries.