Unsupervised Evaluation of Explanations for Hate Speech Classification in Portuguese

Isabel Carvalho; Hugo Gonçalo Oliveira; Catarina Silva

Unsupervised Evaluation of Explanations for Hate Speech Classification in Portuguese

Isabel Carvalho, Hugo Gonçalo Oliveira, Catarina Silva

Abstract

Top-performing Artificial Intelligence models often operate as black boxes. Explainable AI (XAI) can increase transparency, but its evaluation is currently hindered by a lack of annotated explanation data and agreed-upon validation standards. We propose a framework for evaluating the faithfulness of explanations in Portuguese hate speech detection. Our approach is based on the premise that a faithful explanation should identify features whose removal degrades a model’s performance. We follow a three-step process: (i) prediction on the original input; (ii) identification and removal of explanatory keywords; and (iii), prediction on the modified input, with performance differences used as an evaluation signal. We conduct experiments using ensemble classifiers, multiple keyword selection strategies, and SHAP and LIME as XAI methods. In addition, Large Language Models (LLMs) are explored both as classifiers and as explainers. Results demonstrate that removing explanatory keywords degrades model performance more than random word removal, indicating explanation faithfulness. Notably, SHAP and LIME consistently provided more faithful explanations than LLM-generated or manual alternatives, although impact depends on the keyword selection strategy. These findings highlight the importance of standardised, unsupervised evaluation protocols for XAI and the faithfulness limitations of current generative LLM explanations.

Anthology ID:: 2026.propor-1.77
Volume:: Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:: April
Year:: 2026
Address:: Salvador, Brazil
Editors:: Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:: PROPOR
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 780–789
Language:
URL:: https://preview.aclanthology.org/ingest-dnd/2026.propor-1.77/
DOI:
Bibkey:
Cite (ACL):: Isabel Carvalho, Hugo Gonçalo Oliveira, and Catarina Silva. 2026. Unsupervised Evaluation of Explanations for Hate Speech Classification in Portuguese. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 780–789, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):: Unsupervised Evaluation of Explanations for Hate Speech Classification in Portuguese (Carvalho et al., PROPOR 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-dnd/2026.propor-1.77.pdf

PDF Cite Search Fix data