Khetam Al Sharou


2021

pdf bib
Towards a Better Understanding of Noise in Natural Language Processing
Khetam Al Sharou | Zhenhao Li | Lucia Specia
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we propose a definition and taxonomy of various types of non-standard textual content – generally referred to as “noise” – in Natural Language Processing (NLP). While data pre-processing is undoubtedly important in NLP, especially when dealing with user-generated content, a broader understanding of different sources of noise and how to deal with them is an aspect that has been largely neglected. We provide a comprehensive list of potential sources of noise, categorise and describe them, and show the impact of a subset of standard pre-processing strategies on different tasks. Our main goal is to raise awareness of non-standard content – which should not always be considered as “noise” – and of the need for careful, task-dependent pre-processing. This is an alternative to blanket, all-encompassing solutions generally applied by researchers through “standard” pre-processing pipelines. The intention is for this categorisation to serve as a point of reference to support NLP researchers in devising strategies to clean, normalise or embrace non-standard content.