Craig Messner

2026

Pretraining Language Models for Diachronic Linguistic Change Discovery
Elisabeth Fittschen | Sabrina Xin Li | Tom Lippincott | Leshem Choshen | Craig Messner
Findings of the Association for Computational Linguistics: EACL 2026

Large language models (LLMs) are increasingly used as knowledge discovery tools. Humanistic disciplines like historical linguistics and literary studies have shown interest in this capability. These fields often construct arguments on the basis of distinctions between phenomena like time-period or genre. Such methodological investments complicate reliance on LLMs pretrained over large sets of broadly-collected data. We show that efficient pretraining techniques produce useful models of semantic change over modest historical corpora without allowing potential contamination from anachronistic data. We verify that these trained-from-scratch models better respect historical divisions and are more computationally efficient compared to the standard approach of fine-tuning an existing LLM. We compare the trade-offs in general linguistic fluency versus detecting and characterizing various forms of linguistic change, and provide a pipeline implementation of our approach that can be readily adapted and applied to a wide range of diachronic phenomena.

2025

pdf bib abs

Transferring Extreme Subword Style Using Ngram Model-Based Logit Scaling
Craig Messner | Tom Lippincott
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

We present an ngram model-based logit scaling technique that effectively transfers extreme subword stylistic variation to large language models at inference time. We demonstrate its efficacy by tracking the perplexity of generated text with respect to the ngram interpolated and original versions of an evaluation model. Minimizing the former measure while the latter approaches the perplexity of a text produced by a target author or character lets us select a sufficient degree of adaptation while retaining fluency.

2024

pdf bib abs

Pairing Orthographically Variant Literary Words to Standard Equivalents Using Neural Edit Distance Models
Craig Messner | Tom Lippincott
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

We present a novel corpus consisting of orthographically variant words found in works of 19th century U.S. literature annotated with their corresponding “standard” word pair. We train a set of neural edit distance models to pair these variants with their standard forms, and compare the performance of these models to the performance of a set of neural edit distance models trained on a corpus of orthographic errors made by L2 English learners. Finally, we analyze the relative performance of these models in the light of different negative training sample generation strategies, and offer concluding remarks on the unique challenge literary orthographic variation poses to string pairing methodologies.

pdf bib abs

Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus
Craig Messner | Tom Lippincott
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags designed to serve as the basis for computational experiments exploring literarily meaningful orthographic variation. We perform an initial broad set of experiments over this dataset using both token (BERT) and character (CANINE)-level contextual language models. We find indications that the “dialect effect” produced by intentional orthographic variation employs multiple linguistic channels, and that these channels are able to be surfaced to varied degrees given particular language modelling assumptions. Specifically, we find evidence showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface.

Co-authors

Venues

Fix author