Anderson Raymundo Avila

2026

Retrieval-Augmented Generation with Small Language Models for Fake News Detection
Lucca Baptista Silva Ferraz | Jhúlia de Souza Leal | Anderson Raymundo Avila | Thiago Alexandre Salgueiro Pardo | Fernando Batista | Renato Moraes Silva
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

The spread of online misinformation has made fake news detection an essential tool for mitigating its negative impact, but many studies often disregard the temporal information, and existing datasets become outdated as news evolve. Some modern solutions using Retrieval-Augmented Generation (RAG) can solve the problem of unseen news events by providing context to the models. However, there are no studies evaluating the feasibility of web searches to attain context to decide whether a news article is true or not. This work aims to address this gap by conducting a comparative study between RAG-based solutions, traditional fake news classification methods, and deep learning-based methods. The results show that although RAG is a modern and promising technique, it cannot outperform techniques already adopted in the literature.

pdf bib abs

Infox-QC: A Quebec-Focused French Corpus for Misinformation Detection and AI Robustness Assessment
Moetaz Doghmane | Hazem Amamou | Thiziri Sefsaf | Alan Davoust | Anderson Raymundo Avila
Proceedings of the Fifteenth Language Resources and Evaluation Conference

The pervasive spread of online misinformation, often through social media and political campaigns, makes detecting false claims a crucial task for mitigating societal risks. While the vast majority of fake news datasets are developed in English, a critical gap remains for low-resource languages, such as French. To address this, we introduce Infox-QC, a novel French-language corpus focused on misinformation relevant to the Quebec region. Beyond containing real true and fake news, Infox-QC includes two unique subsets of AI-generated fake news: one created by prompting an AI to paraphrase existing fake news, and a second generated by prompting an AI to fabricate fake news from real true reports. This innovative approach allows us to verify the robustness of detection systems against fabricated content, which modern LLMs can generate with convincing efficacy. We establish comprehensive baselines using traditional machine learning methods, BERT-based models, and Large Language Models, both with and without Retrieval-Augmented Generation (RAG). Our results demonstrate that RAG-augmented LLMs offer the strongest contextual understanding, while traditional models provide valuable interpretable baselines. We further provide an exploratory human–LLM thematic agreement analysis to assess annotation consistency. The Infox-QC resource fills a critical void in French-language NLP research, supporting future efforts to explore the regional and cultural dimensions of misinformation through cross-linguistic comparison.

pdf bib abs

PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation
Vamshi Nallaguntla | Aishwarya R. Fursule | Shruti Kshirsagar | Anderson Raymundo Avila
Proceedings of the Fifteenth Language Resources and Evaluation Conference

The growing sophistication of speech generated by Artificial Intelligence (AI) has introduced new challenges in audio deepfake detection. Text-to-speech (TTS) and voice conversion (VC) technologies can create highly convincing synthetic speech with naturalness and intelligibility. This poses serious threats to voice biometric security and to systems designed to combat the spread of spoken misinformation, where synthetic voices may be used to disseminate false or malicious content. While interest in AI-generated speech has increased, resources for evaluating naturalness at the phoneme level remain limited. In this work, we address this gap by presenting the Phoneme-Level DeepFake dataset (PhonemeDF), comprising parallel real and synthetic speech segmented at the phoneme level. Real speech samples are derived from a subset of LibriSpeech, while synthetic samples are generated using four TTS and three VC systems. For each system, phoneme-aligned TextGrid files are obtained using the Montreal Forced Aligner (MFA). We compute the Kullback–Leibler divergence (KLD) between real and synthetic phoneme distributions to quantify fidelity and establish a ranking based on similarity to natural speech. Our findings show a clear correlation between the KLD of real and synthetic phoneme distributions and the performance of classifiers trained to distinguish them, suggesting that KLD can serve as an indicator of the most discriminative phonemes for deepfake detection.

Co-authors

Aishwarya R. Fursule 1

Shruti Kshirsagar 1

Jhúlia de Souza Leal 1

Vamshi Nallaguntla 1

Thiago Alexandre Salgueiro Pardo 1

Thiziri Sefsaf 1

Renato Moraes Silva 1

Venues

LREC2
PROPOR1

Fix author