Twenty Years of HAREM: A Reproducible Audit and Reassessment of Portuguese Named Entity Recognition

Rafael O. Nunes, André Spritzer, Carla M. D. S. Freitas, Dennis G. Balreira


Abstract
For two decades, the HAREM corpus has served as the foundational benchmark for Portuguese Named Entity Recognition (NER), establishing its evaluation paradigm. Virtually all major progress has been measured against its fixed train/test split. This paper presents the first systematic audit of this split, revealing 153 overlapping (contaminated) sentences. We re-evaluate 13 NER models (ranging from CRFs to Transformers) on both the original and a new, decontaminated version of the corpus. Our statistical analysis reveals that decontamination has a significant (p < 0.05) and positive impact on the majority of models. We find that performance gains are most pronounced in the F1_textmacro score (up to +4 points), demonstrating that the contamination primarily harmed generalization on rare entity types. Furthermore, our audit reveals clear evidence of overfitting in some models that benefited from data leakage. We conclude that even minor contamination can distort performance metrics and mask true model generalization. We release our decontaminated benchmark to ensure more reliable future evaluations.
Anthology ID:
2026.propor-1.35
Volume:
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:
April
Year:
2026
Address:
Salvador, Brazil
Editors:
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
351–359
Language:
URL:
https://preview.aclanthology.org/ingest-dnd/2026.propor-1.35/
DOI:
Bibkey:
Cite (ACL):
Rafael O. Nunes, André Spritzer, Carla M. D. S. Freitas, and Dennis G. Balreira. 2026. Twenty Years of HAREM: A Reproducible Audit and Reassessment of Portuguese Named Entity Recognition. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 351–359, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):
Twenty Years of HAREM: A Reproducible Audit and Reassessment of Portuguese Named Entity Recognition (Nunes et al., PROPOR 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-dnd/2026.propor-1.35.pdf