Arlindo R. Galvão Filho

2026

ToxSyn-PT: A Synthetic Fine-Grained Dataset of Minority-Targeted Toxic Language in Portuguese
Iago Alves Brito | Julia Soares Dollis | Fernanda Bufon Farber | Diogo Fernandes | Arlindo R. Galvão Filho
Proceedings of the Fifteenth Language Resources and Evaluation Conference

The development of robust hate speech detection systems remains limited by the lack of large-scale, fine-grained training data, especially for languages beyond English. Existing corpora typically rely on simplistic toxic and non-toxic labels, and the few that capture hate directed at specific minority groups lack the positive counterexamples required to distinguish genuine hate from mere discussion. In this work, we introduce ToxSyn-PT, the first Portuguese large-scale corpus explicitly designed for multi-label hate speech detection across nine protected minority groups, including the non-toxic counterexamples absent in all other public datasets. Generated via a controllable four-stage pipeline, ToxSyn contains discourse-type annotations to capture rhetorical strategies of toxic/non-toxic language, such as sarcasm, dehumanization, and cultural appreciation. Our experiments reveal a catastrophic, mutual generalization failure compared to existing datasets from social-media domains: models trained on social media struggle to generalize to minority-specific contexts, and vice-versa. This finding indicates they are distinct tasks and exposes summary metrics like Macro F1 can be unreliable indicators of true model behavior, as they completely mask model failure. We publicly release ToxSyn on HuggingFace to support reproducible research on synthetic data generation and benchmark progress in hate-speech detection for low- and mid-resource languages.

pdf bib abs

MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers
Fernanda Bufon Farber | Iago Alves Brito | Julia Soares Dollis | Pedro Schindler Freire Brasil Ribeiro | Rafael Teixeira Sousa | Arlindo R. Galvão Filho
Proceedings of the Fifteenth Language Resources and Evaluation Conference

While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages. This creates a critical barrier for other languages, as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus of patient-doctor interactions for the Brazilian Portuguese medical domain. Comprising 384,095 authentic question-answer pairs and covering over 3,200 distinct health-related conditions, the dataset was refined through a rigorous multi-stage curation protocol that employed a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries, resulting in a corpus of approximately 57 million tokens. We further utilize of LLM-driven annotation to classify queries into seven semantic types to capture user intent. To validate MedPT’s utility, we benchmark it in a medical specialty classification task: fine-tuning a 1.7B parameter model achieves an outstanding 94% F1-score on a 20-class setup. Furthermore, our qualitative error analysis shows misclassifications are not random but reflect genuine clinical ambiguities (e.g., between comorbid conditions), proving the dataset’s deep semantic richness. We publicly release MedPT on Hugging Face to support the development of more equitable, accurate, and culturally-aware medical technologies for the Portuguese-speaking world.

2025

pdf bib abs

AKCIT at SemEval-2025 Task 11: Investigating Data Quality in Portuguese Emotion Recognition
Iago A. Brito | Fernanda B. Färber | Julia S. Dollis | Daniel M. Pedrozo | Artur M. A. Novais | Diogo F. C. Silva | Arlindo R. Galvão Filho
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper investigates the impact of data quality and processing strategies on emotion recognition in Brazilian Portuguese (PTBR) texts. We focus on data distribution, linguistic context, and augmentation techniques such as translation and synthetic data generation. To evaluate these aspects, we conduct experiments on the PTBR portion of the BRIGHTER dataset, a manually curated multilingual dataset containing nearly 100,000 samples, of which 4,552 are in PTBR. Our study encompasses both multi-label emotion detection (presence/absence classification) and emotion intensity prediction (0 to 3 scale), following the SemEval 2025 Track 11 setup. Results demonstrate that emotion intensity labels enhance model performance after discretization, and that smaller multilingual models can outperform larger ones in low-resource settings. Our official submission ranked 6th, but further refinements improved our ranking to 3rd, trailing the top submission by only 0.047, reinforcing the significance of a data-centric approach in emotion recognition.

Co-authors

Pedro Schindler Freire Brasil Ribeiro 1

Diogo F. C. Silva 1

Rafael Teixeira Sousa 1

Venues

Fix author