Anatomy of Data Repositories for the Analysis and Detection of Toxicity in Portuguese

Lorena Souza Moreira; Paula Teresa M. Gibrim; Leonardo Rocha; Julio C. S. Reis

Anatomy of Data Repositories for the Analysis and Detection of Toxicity in Portuguese

Lorena Souza Moreira, Paula Teresa M. Gibrim, Leonardo Rocha, Julio C. S. Reis

Abstract

The proliferation of online hate speech requires a rigorous examination of the datasets used to train detection models. In this work, we analyze six Brazilian Portuguese datasets annotated for hate speech or toxicity to investigate how their lexical "anatomy" and domain characteristics affect cross-domain generalization. We combine HurtLex-based lexical profiling with cross-dataset evaluation in a feature-based transfer-learning setup, using BERTimbau embeddings and an XGBoost classifier. Our analysis shows that, although the datasets share a broadly similar macro-level focus, they diverge substantially in how specific terms are used and labeled across platforms and topics. Results indicate that lexical breadth and annotation practices strongly predict transferability: datasets with broader and more heterogeneous toxic vocabulary yield better cross-domain performance, whereas resources with narrow, profanity-centered labeling lead to severe generalization gaps, even when lexical overlap is high. These findings underscore the impact of collection and labeling strategies on the curation and evaluation of Portuguese hate speech datasets. Warning! This work and the referenced datasets contain examples of offensive and hateful language.

Anthology ID:: 2026.propor-1.45
Volume:: Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:: April
Year:: 2026
Address:: Salvador, Brazil
Editors:: Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:: PROPOR
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 456–466
Language:
URL:: https://preview.aclanthology.org/ingest-dnd/2026.propor-1.45/
DOI:
Bibkey:
Cite (ACL):: Lorena Souza Moreira, Paula Teresa M. Gibrim, Leonardo Rocha, and Julio C. S. Reis. 2026. Anatomy of Data Repositories for the Analysis and Detection of Toxicity in Portuguese. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 456–466, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):: Anatomy of Data Repositories for the Analysis and Detection of Toxicity in Portuguese (Moreira et al., PROPOR 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-dnd/2026.propor-1.45.pdf

PDF Cite Search Fix data