Sidsel Boldsen


2021

pdf bib
Survey and reproduction of computational approaches to dating of historical texts
Sidsel Boldsen | Fredrik Wahlberg
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Finding the year of writing for a historical text is of crucial importance to historical research. However, the year of original creation is rarely explicitly stated and must be inferred from the text content, historical records, and codicological clues. Given a transcribed text, machine learning has successfully been used to estimate the year of production. In this paper, we present an overview of several estimation approaches for historical text archives spanning from the 12th century until today.

2019

pdf bib
Identifying Temporal Trends Based on Perplexity and Clustering: Are We Looking at Language Change?
Sidsel Boldsen | Manex Agirrezabal | Patrizia Paggio
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

In this work we propose a data-driven methodology for identifying temporal trends in a corpus of medieval charters. We have used perplexities derived from RNNs as a distance measure between documents and then, performed clustering on those distances. We argue that perplexities calculated by such language models are representative of temporal trends. The clusters produced using the K-Means algorithm give an insight of the differences in language in different time periods at least partly due to language change. We suggest that the temporal distribution of the individual clusters might provide a more nuanced picture of temporal trends compared to discrete bins, thus providing better results when used in a classification task.

pdf bib
The Seemingly (Un)systematic Linking Element in Danish
Sidsel Boldsen | Manex Agirrezabal
Proceedings of the 22nd Nordic Conference on Computational Linguistics

The use of a linking element between compound members is a common phenomenon in Germanic languages. Still, the exact use and conditioning of such elements is a disputed topic in linguistics. In this paper we address the issue of predicting the use of linking elements in Danish. Following previous research that shows how the choice of linking element might be conditioned by phonology, we frame the problem as a language modeling task: Considering the linking elements -s/-āˆ… the problem becomes predicting what is most probable to encounter next, a syllable boundary or the joining element, ā€˜sā€™. We show that training a language model on this task reaches an accuracy of 94 %, and in the case of an unsupervised model, the accuracy reaches 80%.