Marisa Hudspeth
2026
Contextual morphologically-guided tokenization for Latin encoder models
Marisa Hudspeth | Patrick J. Burns | Brendan O'Connor
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Marisa Hudspeth | Patrick J. Burns | Brendan O'Connor
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources – a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models’ improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.
2025
Automated main concept generation for narrative discourse assessment in aphasia
Ankita Gupta | Marisa Hudspeth | Polly Stokes | Jacquie Kurland | Brendan O’Connor
Findings of the Association for Computational Linguistics: ACL 2025
Ankita Gupta | Marisa Hudspeth | Polly Stokes | Jacquie Kurland | Brendan O’Connor
Findings of the Association for Computational Linguistics: ACL 2025
We present an interesting application of narrative understanding in the clinical assessment of aphasia, where story retelling tasks are used to evaluate a patient’s communication abilities. This clinical setting provides a framework to help operationalize narrative discourse analysis and an application-focused evaluation method for narrative understanding systems. In particular, we highlight the use of main concepts (MCs)—a list of statements that capture a story’s gist—for aphasic discourse analysis. We then propose automatically generating MCs from novel stories, which experts can edit manually, thus enabling wider adaptation of current assessment tools. We further develop a prompt ensemble method using large language models (LLMs) to automatically generate MCs for a novel story. We evaluate our method on an existing narrative summarization dataset to establish its intrinsic validity. We further apply it to a set of stories that have been annotated with MCs through extensive analysis of retells from non-aphasic and aphasic participants (Kurland et al., 2021, 2025). Our results show that our proposed method can generate most of the gold-standard MCs for stories from this dataset. Finally, we release this dataset of stories with annotated MCs to spur more research in this area.
2024
Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time
Marisa Hudspeth | Brendan O’Connor | Laure Thompson
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Marisa Hudspeth | Brendan O’Connor | Laure Thompson
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Existing Latin treebanks draw from Latin’s long written tradition, spanning 17 centuries and a variety of cultures. Recent efforts have begun to harmonize these treebanks’ annotations to better train and evaluate morphological taggers. However, the heterogeneity of these treebanks must be carefully considered to build effective and reliable data. In this work, we review existing Latin treebanks to identify the texts they draw from, identify their overlap, and document their coverage across time and genre. We additionally design automated conversions of their morphological feature annotations into the conventions of standard Latin grammar. From this, we build new time-period data splits that draw from the existing treebanks which we use to perform a broad cross-time analysis for POS and morphological feature tagging. We find that BERT-based taggers outperform existing taggers while also being more robust to cross-domain shifts.