Krzysztof Jurkiewicz
2022
Challenging America: Modeling language in longer time scales
Jakub Pokrywka
|
Filip Graliński
|
Krzysztof Jassem
|
Karol Kaczmarek
|
Krzysztof Jurkiewicz
|
Piotr Wierzchon
Findings of the Association for Computational Linguistics: NAACL 2022
The aim of the paper is to apply, for historical texts, the methodology used commonly to solve various NLP tasks defined for contemporary data, i.e. pre-train and fine-tune large Transformer models. This paper introduces an ML challenge, named Challenging America (ChallAm), based on OCR-ed excerpts from historical newspapers collected from the Chronicling America portal. ChallAm provides a dataset of clippings, labeled with metadata on their origin, and paired with their textual contents retrieved by an OCR tool. Three, publicly available, ML tasks are defined in the challenge: to determine the article date, to detect the location of the issue, and to deduce a word in a text gap (cloze test). Strong baselines are provided for all three ChallAm tasks. In particular, we pre-trained a RoBERTa model from scratch from the historical texts. We also discuss the issues of discrimination and hate-speech present in the historical American texts.
Search