Text Preprocessing and its Implications in a Digital Humanities Project

Maria Kunilovskaya, Alistair Plum


Abstract
This paper focuses on data cleaning as part of a preprocessing procedure applied to text data retrieved from the web. Although the importance of this early stage in a project using NLP methods is often highlighted by researchers, the details, general principles and techniques are usually left out due to consideration of space. At best, they are dismissed with a comment “The usual data cleaning and preprocessing procedures were applied”. More coverage is usually given to automatic text annotation such as lemmatisation, part-of-speech tagging and parsing, which is often included in preprocessing. In the literature, the term ‘preprocessing’ is used to refer to a wide range of procedures, from filtering and cleaning to data transformation such as stemming and numeric representation, which might create confusion. We argue that text preprocessing might skew original data distribution with regard to the metadata, such as types, locations and times of registered datapoints. In this paper we describe a systematic approach to cleaning text data mined by a data-providing company for a Digital Humanities (DH) project focused on cultural analytics. We reveal the types and amount of noise in the data coming from various web sources and estimate the changes in the size of the data associated with preprocessing. We also compare the results of a text classification experiment run on the raw and preprocessed data. We hope that our experience and approaches will help the DH community to diagnose the quality of textual data collected from the web and prepare it for further natural language processing.
Anthology ID:
2021.ranlp-srw.13
Volume:
Proceedings of the Student Research Workshop Associated with RANLP 2021
Month:
September
Year:
2021
Address:
Online
Editors:
Souhila Djabri, Dinara Gimadi, Tsvetomila Mihaylova, Ivelina Nikolova-Koleva
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
85–93
Language:
URL:
https://aclanthology.org/2021.ranlp-srw.13
DOI:
Bibkey:
Cite (ACL):
Maria Kunilovskaya and Alistair Plum. 2021. Text Preprocessing and its Implications in a Digital Humanities Project. In Proceedings of the Student Research Workshop Associated with RANLP 2021, pages 85–93, Online. INCOMA Ltd..
Cite (Informal):
Text Preprocessing and its Implications in a Digital Humanities Project (Kunilovskaya & Plum, RANLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2021.ranlp-srw.13.pdf