Tom Lippincott
Also published as: Thomas Lippincott
2026
Pretraining Language Models for Diachronic Linguistic Change Discovery
Elisabeth Fittschen | Sabrina Xin Li | Tom Lippincott | Leshem Choshen | Craig Messner
Findings of the Association for Computational Linguistics: EACL 2026
Elisabeth Fittschen | Sabrina Xin Li | Tom Lippincott | Leshem Choshen | Craig Messner
Findings of the Association for Computational Linguistics: EACL 2026
Large language models (LLMs) are increasingly used as knowledge discovery tools. Humanistic disciplines like historical linguistics and literary studies have shown interest in this capability. These fields often construct arguments on the basis of distinctions between phenomena like time-period or genre. Such methodological investments complicate reliance on LLMs pretrained over large sets of broadly-collected data. We show that efficient pretraining techniques produce useful models of semantic change over modest historical corpora without allowing potential contamination from anachronistic data. We verify that these trained-from-scratch models better respect historical divisions and are more computationally efficient compared to the standard approach of fine-tuning an existing LLM. We compare the trade-offs in general linguistic fluency versus detecting and characterizing various forms of linguistic change, and provide a pipeline implementation of our approach that can be readily adapted and applied to a wide range of diachronic phenomena.
2025
Automatic Language Identification in Texts
Tom Lippincott
Computational Linguistics, Volume 51, Issue 1 - March 2025
Tom Lippincott
Computational Linguistics, Volume 51, Issue 1 - March 2025
Computational Discovery of Chiasmus in Ancient Religious Text
Hope McGovern | Hale Sirin | Tom Lippincott
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Hope McGovern | Hale Sirin | Tom Lippincott
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Chiasmus, a debated literary device in Biblical texts, has captivated mystics while sparking ongoing scholarly discussion. In this paper, we introduce the first computational approach to systematically detect chiasmus within Biblical passages. Our method leverages neural embeddings to capture lexical and semantic patterns associated with chiasmus, applied at multiple levels of textual granularity (half-verses, verses). We also involve expert annotators to review a subset of the detected patterns. Despite its computational efficiency, our method achieves robust results, with high inter-annotator agreement and system accuracy of 0.80 at the verse level and 0.60 at the half-verse level. We further provide a qualitative analysis of the distribution of detected chiasmi, along with selected examples that highlight the effectiveness of our approach.
Characterizing the Effects of Translation on Intertextuality using Multilingual Embedding Spaces
Hope McGovern | Hale Sirin | Tom Lippincott
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Hope McGovern | Hale Sirin | Tom Lippincott
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Rhetorical devices are difficult to translate, but they are crucial to the translation of literary documents. We investigate the use of multilingual embedding spaces to characterize the preservation of intertextuality, one common rhetorical device, across human and machine translation. To do so, we use Biblical texts, which are both full of intertextual references and are highly translated works. We provide a metric to characterize intertextuality at the corpus level and provide a quantitative analysis of the preservation of this rhetorical device across extant human translations and machine-generated counterparts. We go on to provide qualitative analysis of cases wherein human translations over- or underemphasize the intertextuality present in the text, whereas machine translations provide a neutral baseline. This provides support for established scholarship proposing that human translators have a propensity to amplify certain literary characteristics of the original manuscripts.
Transferring Extreme Subword Style Using Ngram Model-Based Logit Scaling
Craig Messner | Tom Lippincott
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Craig Messner | Tom Lippincott
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
We present an ngram model-based logit scaling technique that effectively transfers extreme subword stylistic variation to large language models at inference time. We demonstrate its efficacy by tracking the perplexity of generated text with respect to the ngram interpolated and original versions of an evaluation model. Minimizing the former measure while the latter approaches the perplexity of a text produced by a target author or character lets us select a sufficient degree of adaptation while retaining fluency.
2024
Detecting Structured Language Alternations in Historical Documents by Combining Language Identification with Fourier Analysis
Hale Sirin | Sabrina Li | Tom Lippincott
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
Hale Sirin | Sabrina Li | Tom Lippincott
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
In this study, we present a generalizable workflow to identify documents in a historic language with a nonstandard language and script combination, Armeno-Turkish. We introduce the task of detecting distinct patterns of multilinguality based on the frequency of structured language alternations within a document.
Dynamic embedded topic models and change-point detection for exploring literary-historical hypotheses
Hale Sirin | Tom Lippincott
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
Hale Sirin | Tom Lippincott
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
We present a novel combination of dynamic embedded topic models and change-point detection to explore diachronic change of lexical semantic modality in classical and early Christian Latin. We demonstrate several methods for finding and characterizing patterns in the output, and relating them to traditional scholarship in Comparative Literature and Classics. This simple approach to unsupervised models of semantic change can be applied to any suitable corpus, and we conclude with future directions and refinements aiming to allow noisier, less-curated materials to meet that threshold.
Pairing Orthographically Variant Literary Words to Standard Equivalents Using Neural Edit Distance Models
Craig Messner | Tom Lippincott
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
Craig Messner | Tom Lippincott
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
We present a novel corpus consisting of orthographically variant words found in works of 19th century U.S. literature annotated with their corresponding “standard” word pair. We train a set of neural edit distance models to pair these variants with their standard forms, and compare the performance of these models to the performance of a set of neural edit distance models trained on a corpus of orthographic errors made by L2 English learners. Finally, we analyze the relative performance of these models in the light of different negative training sample generation strategies, and offer concluding remarks on the unique challenge literary orthographic variation poses to string pairing methodologies.
Detecting Narrative Patterns in Biblical Hebrew and Greek
Hope McGovern | Hale Sirin | Tom Lippincott | Andrew Caines
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Hope McGovern | Hale Sirin | Tom Lippincott | Andrew Caines
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
We present a novel approach to extracting recurring narrative patterns, or type-scenes, in Biblical Hebrew and Biblical Greek with an information retrieval network. We use cross-references to train an encoder model to create similar representations for verses linked by a cross-reference. We then query our trained model with phrases informed by humanities scholarship and designed to elicit particular kinds of narrative scenes. Our models can surface relevant instances in the top-10 ranked candidates in many cases.Through manual error analysis and discussion, we address the limitations and challenges inherent in our approach. Our findings contribute to the field of Biblical scholarship by offering a new perspective on narrative analysis within ancient texts, and to computational modeling of narrative with a genre-agnostic approach for pattern-finding in long, literary texts.
Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus
Craig Messner | Tom Lippincott
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Craig Messner | Tom Lippincott
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags designed to serve as the basis for computational experiments exploring literarily meaningful orthographic variation. We perform an initial broad set of experiments over this dataset using both token (BERT) and character (CANINE)-level contextual language models. We find indications that the “dialect effect” produced by intentional orthographic variation employs multiple linguistic channels, and that these channels are able to be surfaced to varied degrees given particular language modelling assumptions. Specifically, we find evidence showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface.
2021
Active learning and negative evidence for language identification
Thomas Lippincott | Ben Van Durme
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Thomas Lippincott | Ben Van Durme
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Language identification (LID), the task of determining the natural language of a given text, is an essential first step in most NLP pipelines. While generally a solved problem for documents of sufficient length and languages with ample training data, the proliferation of microblogs and other social media has made it increasingly common to encounter use-cases that *don’t* satisfy these conditions. In these situations, the fundamental difficulty is the lack of, and cost of gathering, labeled data: unlike some annotation tasks, no single “expert” can quickly and reliably identify more than a handful of languages. This leads to a natural question: can we gain useful information when annotators are only able to *rule out* languages for a given document, rather than supply a positive label? What are the optimal choices for gathering and representing such *negative evidence* as a model is trained? In this paper, we demonstrate that using negative evidence can improve the performance of a simple neural LID model. This improvement is sensitive to policies of how the evidence is represented in the loss function, and for deciding which annotators to employ given the instance and model state. We consider simple policies and report experimental results that indicate the optimal choices for this task. We conclude with a discussion of future work to determine if and how the results generalize to other classification tasks.
Search
Fix author
Co-authors
- Hale Sirin 5
- Craig Messner 4
- Hope McGovern 3
- Annabelle Carrell 2
- Kevin Duh 2
- Benjamin Van Durme 2
- Anna Korhonen 2
- Diarmuid Ó Séaghdha 2
- Deana Burchfield 1
- Andrew Caines 1
- Julianne Chaloux 1
- Tongfei Chen 1
- Leshem Choshen 1
- Alex Comerford 1
- Cash Costello 1
- Mark Dredze 1
- Tim Finin 1
- Elisabeth Fittschen 1
- Benjamin Glass 1
- Nizar Habash 1
- Shudong Hao 1
- Craig Harman 1
- Judith L. Klavans 1
- Philipp Koehn 1
- Dawn Lawrie 1
- Sabrina Li 1
- Sabrina Xin Li 1
- M. Patrick Martin 1
- Chandler May 1
- James Mayfield 1
- Paul McNamee 1
- Scott Miller 1
- Rebecca J. Passonneau 1
- Adam Poliak 1
- Owen Rambow 1
- Mohammad Sadegh Rasooli 1
- Pushpendre Rastogi 1
- Rashmi Sankepally 1
- Pamela Shapiro 1
- Lin Sun 1
- Max Thomas 1
- Ying-Ying Tran 1
- Ben Van Durme 1
- Travis Wolfe 1
- Tae Yano 1
- Ted Zhang 1