Aatu Liimatta


2023

pdf
Measuring the distribution of Hume’s Scotticisms in the ECCO collection
Iiro Tiihonen | Aatu Liimatta | Lidia Pivovarova | Tanja Säily | Mikko Tolonen
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

This short paper studies the distribution of Scotticisms from a list compiled by David Hume in a large collection of 18th century publications. We use regular expression search to find the items on the list in the ECCO collection, and then apply regression analysis to test whether the distribution of Scotticisms in works first published in Scotland is significantly different from the distribution of Scotticisms in works first published in England. We further refine our analysis to trace the influence of variables such as publication date, genre and author’s country of origin.

pdf
Effect of data quality on the automated identification of register features in Eighteenth Century Collections Online
Aatu Liimatta
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

Many large-scale investigations of textual data are based on the automated identification of various linguistic features. However, if the textual data is of lower quality, automated identification of linguistic features, particularly more complex ones, can be severely hampered. Data quality problems are particularly prominent with large datasets of historical text which have been made machine-readable using optical character recognition (OCR) technology, but it is unclear how much the identification of individual linguistic features is affected by the dirty OCR, and how features of varying complexity are influenced differently. In this paper, I analyze the effect of OCR quality on the automated identification of the set of linguistic features commonly used for multi-dimensional register analysis (MDA) by comparing their observed frequencies in the OCR-processed Eighteenth Century Collections Online (ECCO) and a clean baseline (ECCO-TCP). The results show that the identification of most features is disturbed more as the OCR quality decreases, but different features start degrading at different OCR quality levels and do so at different rates.