MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers

Fernanda Bufon Farber, Iago Alves Brito, Julia Soares Dollis, Pedro Schindler Freire Brasil Ribeiro, Rafael Teixeira Sousa, Arlindo R. Galvão Filho


Abstract
While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages. This creates a critical barrier for other languages, as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus of patient-doctor interactions for the Brazilian Portuguese medical domain. Comprising 384,095 authentic question-answer pairs and covering over 3,200 distinct health-related conditions, the dataset was refined through a rigorous multi-stage curation protocol that employed a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries, resulting in a corpus of approximately 57 million tokens. We further utilize of LLM-driven annotation to classify queries into seven semantic types to capture user intent. To validate MedPT’s utility, we benchmark it in a medical specialty classification task: fine-tuning a 1.7B parameter model achieves an outstanding 94% F1-score on a 20-class setup. Furthermore, our qualitative error analysis shows misclassifications are not random but reflect genuine clinical ambiguities (e.g., between comorbid conditions), proving the dataset’s deep semantic richness. We publicly release MedPT on Hugging Face to support the development of more equitable, accurate, and culturally-aware medical technologies for the Portuguese-speaking world.
Anthology ID:
2026.lrec-main.190
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
2419–2429
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.190/
DOI:
Bibkey:
Cite (ACL):
Fernanda Bufon Farber, Iago Alves Brito, Julia Soares Dollis, Pedro Schindler Freire Brasil Ribeiro, Rafael Teixeira Sousa, and Arlindo R. Galvão Filho. 2026. MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers. International Conference on Language Resources and Evaluation, main:2419–2429.
Cite (Informal):
MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers (Farber et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.190.pdf