Global vs. Local Sentence Embeddings for Brazilian Portuguese: Revisiting Monolingual Models in the Age of Foundation Models
Matheus Peixoto, Guilherme Silva, Giacomo Figueredo, Pedro Silva, Eduardo J. S. Luz
Abstract
The choice between large-scale, multilingual, foundation models and specialized monolingual models for languages like Brazilian Portuguese (PT-BR) presents a complex trade-off between generalization and specialization. This paper investigates this trade-off through an empirical study across a diverse suite of tasks. We evaluate multiple families of language models under both linear probing and fine-tuning regimes. We find that monolingual encoders exhibit greater "adaptation plasticity" during fine-tuning, improving on both classification and semantic similarity, where global (multilingual) models degrade. However, this plasticity comes at a cost: our tokenization analysis suggests that monolingual models struggle with foreign terms, whereas modern multilingual tokenizers show surprising morphological competence, challenging a long-standing assumption in the field. We conclude that the optimal model choice is a task-dependent trade-off between vocabulary coverage and adaptation flexibility.- Anthology ID:
- 2026.propor-1.52
- Volume:
- Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
- Month:
- April
- Year:
- 2026
- Address:
- Salvador, Brazil
- Editors:
- Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
- Venue:
- PROPOR
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 529–539
- Language:
- URL:
- https://preview.aclanthology.org/ingest-dnd/2026.propor-1.52/
- DOI:
- Cite (ACL):
- Matheus Peixoto, Guilherme Silva, Giacomo Figueredo, Pedro Silva, and Eduardo J. S. Luz. 2026. Global vs. Local Sentence Embeddings for Brazilian Portuguese: Revisiting Monolingual Models in the Age of Foundation Models. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 529–539, Salvador, Brazil. Association for Computational Linguistics.
- Cite (Informal):
- Global vs. Local Sentence Embeddings for Brazilian Portuguese: Revisiting Monolingual Models in the Age of Foundation Models (Peixoto et al., PROPOR 2026)
- PDF:
- https://preview.aclanthology.org/ingest-dnd/2026.propor-1.52.pdf