Field of Science and Technology Classification of Academic Documents in Portuguese

Ivo Simões, Hugo Gonçalo Oliveira, João Correia


Abstract
Towards improving metadata in academic repositories, this study evaluates the efficacy of different transformer-based models in the automatic classification of the Field of Science and Technology (FOS) of academic theses written in Portuguese. We compare the performance of four different encoder models, two multilingual and two Portuguese-specific, against five larger decoder-based LLMs, on a dataset of 9,696 theses characterized by their title, keywords, and abstract. Fine-tuned encoder-based models achieved the best scores (F1 = 88%), outperforming general-purpose decoder models prompted for the task. These results suggest that, for localized academic domains, task-specific fine-tuning remains more effective than general-purpose LLM prompting.
Anthology ID:
2026.propor-1.104
Volume:
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:
April
Year:
2026
Address:
Salvador, Brazil
Editors:
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1021–1026
Language:
URL:
https://preview.aclanthology.org/ingest-dnd/2026.propor-1.104/
DOI:
Bibkey:
Cite (ACL):
Ivo Simões, Hugo Gonçalo Oliveira, and João Correia. 2026. Field of Science and Technology Classification of Academic Documents in Portuguese. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 1021–1026, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):
Field of Science and Technology Classification of Academic Documents in Portuguese (Simões et al., PROPOR 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-dnd/2026.propor-1.104.pdf