Diandra Fabre

2026

Learning to Spot Signs from Named Entities. A study on French Sign Language.
Julie Halbout | Annelies Braffort | Michèle Gouiffès | Diandra Fabre | Julie Lascar
Proceedings of the LREC 2026 12th Workshop on the Representation and Processing of Sign Languages: Language in Motion

French Sign Language (LSF) is a low-resourced language, with few available corpora, most of which being only partially annotated. Previous work on other sign languages has explored automatic sign annotation using subtitles as weak supervision, existing signaries, or mouthing cues. This paper focuses on the corpus Matignon-LSF, by first leveraging lexical token spotting then by studying Named Entities (locations, companies, persons). Accounting for the Named entities enables the automatic detection of 30% to 100% more signs per class and improves the spotting of rare signs. In addition, this work provides insights into the signing of named entities and contributes resources for improving LSF-to-French translation models.

bib abs

Leveraging Text-side Augmentation For Sign Language Translation
Diandra Fabre | Julie Lascar | Julie Halbout | Markarit Vartampetian
Proceedings of the LREC 2026 12th Workshop on the Representation and Processing of Sign Languages: Language in Motion

Sign language translation faces significant challenges due to the scarcity of annotated data and the inherent complexity of sign languages. This paper presents a method to improve sign-to-text translation models by augmenting data on the text side. We conduct experiments using two state-of-the-art models on two publicly available datasets: PHOENIX-2014T for German Sign Language and Mediapi-RGB for French Sign Language. Our main contributions are : (1) augmenting the training sets of both datasets on the text side using a generative model, (2) evaluating the impact of paraphrasing on BLEU and BLEURT scores, and (3) analyzing the impact of paraphrasing on translation outputs. We observed a significant improvement in translation for both languages. This suggests that adding variability to the training dataset through paraphrasing can lead to better generalization of the models. These results are comparable to state-of-the-art methods that use more complex approaches, such as Visual-Language fine-tuning, to improve translation.

We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l’Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.

bib abs

This position paper argues that the under-representation of social science tasks in contemporary LLM benchmarks limits advances in both LLM evaluation and social scientific inquiry. Benchmarks — standardized tools for assessing computational systems — are pivotal in the development of artificial intelligence (AI), including large language models (LLMs). Benchmarks do more than measure progress — they actively structure it, shaping reputations, research agendas, and commercial outcomes. Despite this central role, the social sciences are largely absent from mainstream evaluation frameworks, even though scholars in these fields generate dozens of rigorously annotated, context-sensitive datasets each year. Integrating this work into benchmark design could significantly improve the generalization and robustness of AI models. In turn, models trained on social scientific tasks would likely yield better performance on classic and contemporary tasks in disciplines as diverse as history, sociology, political science or economics. This is all the more pressing as these disciplines are quickly turning to LLMs for assistance. To address this gap, we introduce BenCSSmark, a benchmark composed of datasets annotated by computational social scientists. By integrating social scientific perspectives into benchmarking, BenCSSmark seeks to promote more robust, transparent, and socially relevant AI systems and to foster efficient collaboration.

bib abs

Building a Dataset for French Accent Classification Evaluation: Are We There Yet?
Diandra Fabre | Mathieu Avanzi | François Portet
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Current evaluation practices in speech processing systems often overlook the diversity of spoken accents, leading to significant performance disparities across speaker groups. This issue largely comes from biases and imbalances in training corpora, and is further compounded by the scarcity of open-source datasets suitable for evaluating accent variability in French. To address this gap, we extend the CFPR dataset with explicit accent labels, providing a new benchmark for assessing the robustness of speech technology systems across diverse French accents. We additionally conduct a perceptual study with 87 human participants to evaluate the reliability and interpretability of these labels. Using this resource, we evaluated an eight-class French accent classifier trained on Common Voice data. The first results highlight both the complexity of automatic French accent recognition in low-resource settings, and the difficulty for French-speakers to perceive all the linguistic variabilities in French-speaking countries.

2025

pdf bib abs

Corpus bilingue sous-titrage et Langue des Signes Française : la problématique de l’alignement automatique des données
Julie Halbout | Diandra Fabre
Actes des 18e Rencontres Jeunes Chercheurs en RI (RJCRI) et 27ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL)

Dans cet article, nous présentons une étude sur la problématique de l’alignement automatique des données dans un corpus constitué de discours en français parlé, sous-titrés en français écrit et interprétés en langue des signes française (LSF). Après une introduction précisant le processus bien particulier de l’interprétation en langue des signes, nous dressons un tour d’horizon des ensembles de données existants pour la LSF ainsi que les spécificités du corpus Matignon-LSF, constitué à partir des comptes-rendus vidéos hebdomadaires du conseil des ministres. Nous montrons ensuite sur quelques exemples certains des phénomènes observés sur la problématique de l’alignement temporel entre les sous-titres synchronisés avec l’audio, et la LSF interprétée qui subit un décalage temporel. Nous en concluons que le niveau d’alignement ne peut pas être celui des phrases en français écrit et proposons quelques pistes pour la suite.

pdf bib abs

SuperGPQA-HCE-FR : un corpus spécialisé en français pour le domaine hydraulique et le génie civil
Markarit Vartampetian | Diandra Fabre | Philippe Mulhem | Sylvain Joubert | Didier Schwab
Actes de l'atelier Évaluation des modèles génératifs (LLM) et challenge 2025 (EvalLLM)

Dans cet article, nous présentons SuperGPQA-HCE-FR, une adaptation française d’un sous-ensemble du benchmark SuperGPQA axé sur les domaines de l’ingénierie hydraulique et du génie civil. Il comprend 285 questions à choix multiples conçues pour évaluer et spécialiser des modèles de langue multilingues de grande taille (LLMs) sur des tâches techniques. La traduction réalisée automatiquement est ensuite évaluée par des experts des domaines. Enfin, nous présentons les premiers résultats sur des modèles Instruct généralistes multilingues en comparant les performances du corpus original en anglais à celles du corpus traduit en français.

2024

pdf bib

Matignon-LSF: a Large Corpus of Interpreted French Sign Language
Julie Halbout | Diandra Fabre | Yanis Ouakrim | Julie Lascar | Annelies Braffort | Michèle Gouiffès | Denis Beautemps
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources

Diandra Fabre

2026

2025

2024

Co-authors

Venues