Luis Meneses-Lerín
2020
Automatic Detection of Offensive Language in Social Media: Defining Linguistic Criteria to build a Mexican Spanish Dataset
María José Díaz-Torres
|
Paulina Alejandra Morán-Méndez
|
Luis Villasenor-Pineda
|
Manuel Montes-y-Gómez
|
Juan Aguilera
|
Luis Meneses-Lerín
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Phenomena such as bullying, homophobia, sexism and racism have transcended to social networks, motivating the development of tools for their automatic detection. The challenge becomes greater for languages rich in popular sayings, colloquial expressions and idioms which may contain vulgar, profane or rude words, but not always have the intention of offending, as is the case of Mexican Spanish. Under these circumstances, the identification of the offense goes beyond the lexical and syntactic elements of the message. This first work aims to define the main linguistic features of aggressive, offensive and vulgar language in social networks in order to establish linguistic-based criteria to facilitate the identification of abusive language. For this purpose, a Mexican Spanish Twitter corpus was compiled and analyzed. The dataset included words that, despite being rude, need to be considered in context to determine they are part of an offense. Based on the analysis of this corpus, linguistic criteria were defined to determine whether a message is offensive. To simplify the application of these criteria, an easy-to-follow diagram was designed. The paper presents an example of the use of the diagram, as well as the basic statistics of the corpus.