Arlindo Rodrigues Galvão Filho


2025

pdf bib
BRSpeech-DF: A Deep Fake Synthetic Speech Dataset for Portuguese Zero-Shot TTS
Alexandre Costa Ferro Filho | Rafaello Virgilli | Lucas Alcantara Souza | F S de Oliveira | Marcelo Henrique Lopes Ferreira | Daniel Tunnermann | Gustavo Dos Reis Oliveira | Anderson Da Silva Soares | Arlindo Rodrigues Galvão Filho
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The detection of audio deepfakes (ADD) has become increasingly important due to the rapid evolution of generative speech models. However, progress in this field remains uneven across languages, particularly for low-resource languages like Portuguese, which lack high-quality datasets. In this paper, we introduce BRSpeech-DF, the first publicly available ADD dataset for Portuguese, encompassing both Brazilian and European variants. The dataset contains over 458,000 utterances, including a smaller portion of real speech from 62 speakers and a large collection of synthetic samples generated using multiple zero-shot text-to-speech (TTS) models, each conditioned on the original speaker’s voice. By providing this resource, our objective is to support the development of robust, multilingual detection systems, thereby advancing equity in speech forensics and security research. BRSpeech-DF addresses a significant gap in annotated data for underrepresented languages, facilitating more inclusive and generalizable advancements in synthetic speech detection.

pdf bib
Portuguese Automated Fact-checking: Information Retrieval with Claim extraction
Juliana Gomes | Eduardo Garcia | Arlindo Rodrigues Galvão Filho
Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER)

Current Portuguese Automated Fact-Checking (AFC) research often relies on datasets lacking integrated external evidence crucial for comprehensive verification. This study addresses this gap by systematically enriching Portuguese misinformation datasets. We retrieve web evidence by simulating user information-seeking behavior, guided by core claims extracted using Large Language Models (LLMs). Additionally, we apply a semi-automated validation framework to enhance dataset reliability.Our analysis reveals that inherent dataset characteristics impact data properties, evidence retrieval, and AFC model performance. While enrichment generally improves detection, its efficacy varies, influenced by challenges such as self-reinforcing online misinformation and API limitations. This work contributes enriched datasets, associating original texts with retrieved evidence and LLM-extracted claims, to foster future evidence-based fact-checking research.The code and enriched data for this study is available at https://github.com/ju-resplande/pt_afc.

pdf bib
Modeling, Evaluating, and Embodying Personality in LLMs: A Survey
Iago Alves Brito | Julia Soares Dollis | Fernanda Bufon Färber | Pedro Schindler Freire Brasil Ribeiro | Rafael Teixeira Sousa | Arlindo Rodrigues Galvão Filho
Findings of the Association for Computational Linguistics: EMNLP 2025

As large language models (LLMs) become integral to social and interactive applications, the ability to model, control, and evaluate their personality traits has become a critical area of research. This survey provides a comprehensive and structured overview of the LLM-driven personality scenario. We introduce a functional taxonomy that organizes the field by how personality is modeled (from rule-based methods to model-centric and system-level LLM techniques), across which modalities it is expressed (extending beyond text to vision, speech, and immersive virtual reality), and how it is validated (covering both qualitative and quantitative evaluation paradigms). By contextualizing current advances and systematically analyzing the limitations of existing methods including subjectivity, context dependence, limited multimodal integration, and the lack of standardized evaluation protocols, we identify key research gaps. This survey serves as a guide for future inquiry, paving the way for the development LLMs with more consistent consistent, expressive, and trustworthy personality traits.

pdf bib
Proxy Barrier: A Hidden Repeater Layer Defense Against System Prompt Leakage and Jailbreaking
Pedro Schindler Freire Brasil Ribeiro | Iago Alves Brito | Rafael Teixeira Sousa | Fernanda Bufon Färber | Julia Soares Dollis | Arlindo Rodrigues Galvão Filho
Findings of the Association for Computational Linguistics: EMNLP 2025

Prompt injection and jailbreak attacks remain a critical vulnerability for deployed large language models (LLMs), allowing adversaries to bypass safety protocols and extract sensitive information. To address this, we present Proxy Barrier (ProB), a lightweight defense that interposes a proxy LLM between the user and the target model. The proxy LLM is tasked solely to repeat the user input, and any failure indicates the presence of an attempt to reveal or override system instructions, leading the malicious request to be detected and blocked before it reaches the target model. ProB therefore requires no access to model weights or prompts, and is deployable entirely at the API level. Experiments across multiple model families demonstrate that ProB achieves state-of-the-art resilience against prompt leakage and jailbreak attacks. Notably, our approach outperforms baselines and achieves up to 98.8% defense effectiveness, and shows robust protection across both open and closed-source LLMs when suitably paired with proxy models, while also keeping response quality intact.