Florian Cafiero

Also published as: Florian Raphaël Cafiero


2026

Graph-based Retrieval-Augmented Generation (RAG) is increasingly used to explore long, heterogeneous, and weakly structured corpora, including historical archives. However, in such settings, naive full-corpus indexing is often computationally costly and sensitive to OCR noise, document redundancy, and topical dispersion. In this paper, we investigate corpus pre-targeting strategies as an intermediate layer to improve the efficiency and effectiveness of graph-based RAG for historical research.We evaluate a set of pre-targeting heuristics tailored to single-hop and multi-hop of historical questions on HistoriQA-ThirdRepublic, a French question-answering dataset derived from parliamentary debates and contemporary newspapers. Our results show that appropriate pre-targeting strategies can improve retrieval recall by 3–5% while reducing token consumption by 32–37% compared to full-corpus indexing, without degrading coverage of relevant documents.Beyond performance gains, this work highlights the importance of corpus-level optimization for applying RAG to large-scale historical collections, and provides practical insights for adapting graph-based RAG pipelines to the specific constraints of digitized archives.
Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.

2024

This study explores the potential of stylometric analysis in identifying Self-Defining Memories (SDMs) authored by individuals with Attention-Deficit/Hyperactivity Disorder (ADHD) versus a control group. A sample of 198 SDMs were written by 66 adolescents and were then analysed using Support Vector Classifiers (SVC). The analysis included a variety of linguistic features such as character 3-grams, function words, sentence length, or lexical richness among others. It also included metadata about the participants (gender, age) and their SDMs (self-reported sentiment after recalling their memories). The results reveal a promising ability of linguistic analysis to accurately classify SDMs, with perfect prediction (F1=1.0) in the contextually simpler setup of text-by-text prediction, and satisfactory levels of precision (F1 = 0.77) when predicting individual by individual. Such results highlight the significant role that linguistic characteristics play in reflecting the distinctive cognitive patterns associated with ADHD. While not a substitute for professional diagnosis, textual analysis offers a supportive avenue for early detection and a deeper understanding of ADHD.