PortOldBERT: Portuguese Historical Language Models

Tomas Freitas Osorio, Henrique Lopes Cardoso


Abstract
Historical language models play a crucial role in the study of languages, and can benefit tasks such as named-entity recognition (NER), part-of-speech (PoS) tagging, and post-OCR correction, among others. Despite their relevance, most efforts have been concentrated on English. To the best of our knowledge, no such model exists for historical Portuguese. In this work, we introduce PortOldBERT, the first historical Portuguese encoder language model. We demonstrate its usefulness by comparing PortOldBERT’s performance with Albertina, the encoder on which it is based, across multiple tasks—pseudo-perplexity, NER, PoS tagging, word error rate (WER) prediction, and OCR error detection—and for different historical periods. PortOldBERT consistently outperforms Albertina in historical data, demonstrating its ability to effectively integrate historical linguistic contexts while retaining the ability to process contemporary text.
Anthology ID:
2026.eacl-long.123
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2691–2705
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.123/
DOI:
Bibkey:
Cite (ACL):
Tomas Freitas Osorio and Henrique Lopes Cardoso. 2026. PortOldBERT: Portuguese Historical Language Models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2691–2705, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
PortOldBERT: Portuguese Historical Language Models (Osorio & Lopes Cardoso, EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.123.pdf