Lauren Fonteyn

2022

pdf abs
Non-Parametric Word Sense Disambiguation for Historical Languages
Enrique Manjavacas Arevalo | Lauren Fonteyn
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

Recent approaches to Word Sense Disambiguation (WSD) have profited from the enhanced contextualized word representations coming from contemporary Large Language Models (LLMs). This advancement is accompanied by a renewed interest in WSD applications in Humanities research, where the lack of suitable, specific WSD-annotated resources is a hurdle in developing ad-hoc WSD systems. Because they can exploit sentential context, LLMs are particularly suited for disambiguation tasks. Still, the application of LLMs is often limited to linear classifiers trained on top of the LLM architecture. In this paper, we follow recent developments in non-parametric learning and show how LLMs can be efficiently fine-tuned to achieve strong few-shot performance on WSD for historical languages (English and Dutch, date range: 1450-1950). We test our hypothesis using (i) a large, general evaluation set taken from large lexical databases, and (ii) a small real-world scenario involving an ad-hoc WSD task. Moreover, this paper marks the release of GysBERT, a LLM for historical Dutch.

2021

pdf abs
MacBERTh: Development and Evaluation of a Historically Pre-trained Language Model for English (1450-1950)
Enrique Manjavacas Arevalo | Lauren Fonteyn
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

The new pre-train-then-fine-tune paradigm in Natural made important performance gains accessible to a wider audience. Once pre-trained, deploying a large language model presents comparatively small infrastructure requirements, and offers robust performance in many NLP tasks. The Digital Humanities community has been an early adapter of this paradigm. Yet, a large part of this community is concerned with the application of NLP algorithms to historical texts, for which large models pre-trained on contemporary text may not provide optimal results. In the present paper, we present “MacBERTh”—a transformer-based language model pre-trained on historical English—and exhaustively assess its benefits on a large set of relevant downstream tasks. Our experiments highlight that, despite some differences across target time periods, pre-training on historical language from scratch outperforms models pre-trained on present-day language and later adapted to historical language.

Co-authors

Enrique Manjavacas 2

Venues

nlp4dh2