Automatic Question classification in Portuguese: A Large-Scale Dataset and Comparative Evaluation of Classification Strategies

Murilo Boccardo, Valéria D. Feltrim


Abstract
This paper presents a comparative evaluation of automatic classification strategies for Brazilian university entrance exam questions by subject and fine-grained topic. A central contribution of this study is the creation and curation of a large-scale Portuguese-language dataset comprising approximately 17,000 questions collected from the Agatha.edu platform, carefully cleaned and normalized. We investigated two alternative classification strategies: a single-step approach that directly predicts fine-grained topics and a two-stage approach in which an initial model predicts the subject, followed by specialized topic classifiers. These strategies were evaluated using both classical machine learning methods, such as Support Vector Machines, Naive Bayes, and Random Forest, and transformer-based language models pre-trained for Portuguese. Experimental results show the feasibility of large-scale automatic question classification and highlight the potential of NLP-based classification strategies to support the curation, analysis, and organization of educational question banks.
Anthology ID:
2026.propor-1.43
Volume:
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:
April
Year:
2026
Address:
Salvador, Brazil
Editors:
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
436–445
Language:
URL:
https://preview.aclanthology.org/ingest-dnd/2026.propor-1.43/
DOI:
Bibkey:
Cite (ACL):
Murilo Boccardo and Valéria D. Feltrim. 2026. Automatic Question classification in Portuguese: A Large-Scale Dataset and Comparative Evaluation of Classification Strategies. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 436–445, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):
Automatic Question classification in Portuguese: A Large-Scale Dataset and Comparative Evaluation of Classification Strategies (Boccardo & Feltrim, PROPOR 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-dnd/2026.propor-1.43.pdf