Leonardo Chaves Dutra Da Rocha

Also published as: Leonardo Chaves Dutra da Rocha

2026

When High Accuracy Hides Poor Calibration: Rethinking Confidence Evaluation in Transformer-Based Text Classification with Balanced Brier Score
Guilherme Fonseca | Gabriel Prenassi | Washington Cunha | Leonardo Chaves Dutra da Rocha | Marcos Andr\'e Gon\c{c}alves
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Transformer-based Small (SLMs) and Large Language Models (LLMs) achieve strong effectiveness in text classification (TC), yet deployment requires reliable confidence estimates. Although miscalibration in Transformers has been reported, evidence for TC under fine-tuning remains limited. We evaluate the calibration of fine-tuned SLMs and LLMs against Logistic Regression, a classical, well-calibrated baseline, and find that, despite superior effectiveness, Transformers remain markedly overconfident. Crucially, we show that widely used calibration metrics, such as Expected Calibration Error and Brier Score, become biased in high-effectiveness regimes, where the dominance of correct predictions masks severe miscalibration on errors, sometimes even suggesting better calibration than Logistic Regression, a well-known calibrated method. To address this limitation, we propose the Balanced Brier Score (BBS), which balances the contribution of correct and incorrect predictions within confidence bins. BBS reveals substantially poorer calibration in both SLMs and LLMs, consistent with qualitative evidence from calibration curves. These findings challenge current calibration assessment practices and provide a more reliable alternative for evaluating confidence quality in Transformer-based TC.

2025

pdf bib abs

Instance-Selection-Inspired Undersampling Strategies for Bias Reduction in Small and Large Language Models for Binary Text Classification
Guilherme Fonseca | Washington Cunha | Gabriel Prenassi | Marcos André Gonçalves | Leonardo Chaves Dutra Da Rocha
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Skewness in imbalanced datasets affects Automatic Text Classification (ATC), leading to classifier bias toward the majority classes. This work examines undersampling methods to mitigate such bias in Small and Large Language Model (SLMs and LLMs) classifiers. Based on the limitations found in existing solutions, we propose two novel undersampling methods inspired by state-of-the-art Instance Selection techniques, relying on calibrated confidences and semantic difficulty estimates. We compare them against 19 baselines across 13 datasets, evaluating: (i) effectiveness, (ii) class imbalance bias, (iii) efficiency, (iv) scalability, and (v) consistency. Results show our methods uniquely reduce classifier bias (up to 56%) across all datasets without effectiveness loss while improving efficiency (1.6x speedup), scalability and reducing carbon emissions (up to 50%).

2024

pdf bib abs

Explaining the Hardest Errors of Contextual Embedding Based Classifiers
Claudio Moisés Valiense De Andrade | Washington Cunha | Guilherme Fonseca | Ana Clara Souza Pagano | Luana De Castro Santos | Adriana Silvina Pagano | Leonardo Chaves Dutra Da Rocha | Marcos André Gonçalves
Proceedings of the 28th Conference on Computational Natural Language Learning

We seek to explain the causes of the misclassification of the most challenging documents, namely those that no classifier using state-of-the-art, very semantically-separable contextual embedding representations managed to predict accurately. To do so, we propose a taxonomy of incorrect predictions, which we used to perform qualitative human evaluation. We posed two (research) questions, considering three sentiment datasets in two different domains – movie and product reviews. Evaluators with two different backgrounds evaluated documents by comparing the predominant sentiment assigned by the model to the label in the gold dataset in order to decide on a likely misclassification reason. Based on a high inter-evaluator agreement (81.7%), we observed significant differences between the product and movie review domains, such as the prevalence of ambivalence in product reviews and sarcasm in movie reviews. Our analysis also revealed an unexpectedly high rate of incorrect labeling in the gold dataset (up to 33%) and a significant amount of incorrect prediction by the model due to a series of linguistic phenomena (including amplified words, contrastive markers, comparative sentences, and references to world knowledge). Overall, our taxonomy and methodology allow us to explain between 80%-85% of the errors with high confidence (agreement) – enabling us to point out where future efforts to improve models should be concentrated.

Co-authors

Marcos Andr\'e Gon\c{c}alves 1

Ana Clara Souza Pagano 1

Adriana Silvina Pagano 1

Luana De Castro Santos 1

Venues

ACL2
CoNLL1

Fix author