2023
pdf
abs
Detection and attribution of quotes in Finnish news media: BERT vs. rule-based approach
Maciej Janicki
|
Antti Kanner
|
Eetu Mäkelä
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
We approach the problem of recognition and attribution of quotes in Finnish news media. Solving this task would create possibilities for large-scale analysis of media wrt. the presence and styles of presentation of different voices and opinions. We describe the annotation of a corpus of media texts, numbering around 1500 articles, with quote attribution and coreference information. Further, we compare two methods for automatic quote recognition: a rule-based one operating on dependency trees and a machine learning one built on top of the BERT language model. We conclude that BERT provides more promising results even with little training data, achieving 95% F-score on direct quote recognition and 84% for indirect quotes. Finally, we discuss open problems and further associated tasks, especially the necessity of resolving speaker mentions to entity references.
2022
pdf
abs
Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model
Iiro Rastas
|
Yann Ciarán Ryan
|
Iiro Tiihonen
|
Mohammadreza Qaraei
|
Liina Repo
|
Rohit Babbar
|
Eetu Mäkelä
|
Mikko Tolonen
|
Filip Ginter
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change
In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents. The ECCO dataset poses unique modelling challenges due to the presence of Optical Character Recognition (OCR) artifacts. We establish the performance of the BERT model on a publication year prediction task against linear baseline models and human judgement, finding the BERT model to be superior to both and able to date the works, on average, with less than 7 years absolute error. We also explore how language change over time affects the model by analyzing the features the model uses for publication year predictions as given by the Integrated Gradients model explanation method.
2019
pdf
abs
Revisiting NMT for Normalization of Early English Letters
Mika Hämäläinen
|
Tanja Säily
|
Jack Rueter
|
Jörg Tiedemann
|
Eetu Mäkelä
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.
2018
pdf
abs
Normalizing Early English Letters to Present-day English Spelling
Mika Hämäläinen
|
Tanja Säily
|
Jack Rueter
|
Jörg Tiedemann
|
Eetu Mäkelä
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated. All of the methods have their own strengths in word normalization. This calls for finding ways of combining the results from these methods to leverage their individual strengths.