Alexandre Daniel Audibert
2026
Pantagruel: Unified Self-Supervised Encoders for French Text and Speech
Phuong-Hang Le | Valentin Pelloin | Arnault Chatelain | Maryem Bouziane | Mohammed Ghennai | Qianwen Guan | Kirill Milintsevich | Salima Mdhaffar | Aidan Mannion | Nils Defauw | Shuyue Gu | Alexandre Daniel Audibert | Marco Dinarelli | Yannick Estève | Lorraine Goeuriot | Steffen Lalande | Nicolas Hervé | Maximin Coavoux | François Portet | Étienne Ollion | Marie Candito | Maxime Peyrard | Solange Rossato | Benjamin Lecouteux | Aurélie Nardy | Gilles Sérasset | Vincent Segonne | Solène Evain | Diandra Fabre | Didier Schwab
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Phuong-Hang Le | Valentin Pelloin | Arnault Chatelain | Maryem Bouziane | Mohammed Ghennai | Qianwen Guan | Kirill Milintsevich | Salima Mdhaffar | Aidan Mannion | Nils Defauw | Shuyue Gu | Alexandre Daniel Audibert | Marco Dinarelli | Yannick Estève | Lorraine Goeuriot | Steffen Lalande | Nicolas Hervé | Maximin Coavoux | François Portet | Étienne Ollion | Marie Candito | Maxime Peyrard | Solange Rossato | Benjamin Lecouteux | Aurélie Nardy | Gilles Sérasset | Vincent Segonne | Solène Evain | Diandra Fabre | Didier Schwab
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l’Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.
2024
Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains
Vincent Segonne | Aidan Mannion | Laura Cristina Alonzo Canul | Alexandre Daniel Audibert | Xingyu Liu | Cécile Macaire | Adrien Pupier | Yongxin Zhou | Mathilde Aguiar | Felix E. Herron | Magali Norré | Massih R Amini | Pierrette Bouillon | Iris Eshkol-Taravella | Emmanuelle Esperança-Rodier | Thomas François | Lorraine Goeuriot | Jérôme Goulian | Mathieu Lafourcade | Benjamin Lecouteux | François Portet | Fabien Ringeval | Vincent Vandeghinste | Maximin Coavoux | Marco Dinarelli | Didier Schwab
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Vincent Segonne | Aidan Mannion | Laura Cristina Alonzo Canul | Alexandre Daniel Audibert | Xingyu Liu | Cécile Macaire | Adrien Pupier | Yongxin Zhou | Mathilde Aguiar | Felix E. Herron | Magali Norré | Massih R Amini | Pierrette Bouillon | Iris Eshkol-Taravella | Emmanuelle Esperança-Rodier | Thomas François | Lorraine Goeuriot | Jérôme Goulian | Mathieu Lafourcade | Benjamin Lecouteux | François Portet | Fabien Ringeval | Vincent Vandeghinste | Maximin Coavoux | Marco Dinarelli | Didier Schwab
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Pretrained Language Models (PLMs) are the de facto backbone of most state-of-the-art NLP systems. In this paper, we introduce a family of domain-specific pretrained PLMs for French, focusing on three important domains: transcribed speech, medicine, and law. We use a transformer architecture based on efficient methods (LinFormer) to maximise their utility, since these domains often involve processing long documents. We evaluate and compare our models to state-of-the-art models on a diverse set of tasks and datasets, some of which are introduced in this paper. We gather the datasets into a new French-language evaluation benchmark for these three domains. We also compare various training configurations: continued pretraining, pretraining from scratch, as well as single- and multi-domain pretraining. Extensive domain-specific experiments show that it is possible to attain competitive downstream performance even when pre-training with the approximative LinFormer attention mechanism. For full reproducibility, we release the models and pretraining data, as well as contributed datasets.
Search
Fix author
Co-authors
- Maximin Coavoux 2
- Marco Dinarelli 2
- Lorraine Goeuriot 2
- Benjamin Lecouteux 2
- Aidan Mannion 2
- François Portet 2
- Didier Schwab 2
- Vincent Segonne 2
- Mathilde Aguiar 1
- Laura Cristina Alonzo Canul 1
- Massih R Amini 1
- Pierrette Bouillon 1
- Maryem Bouziane 1
- Marie Candito 1
- Arnault Chatelain 1
- Nils Defauw 1
- Iris Eshkol 1
- Emmanuelle Esperança-Rodier 1
- Yannick Estève 1
- Solène Evain 1
- Diandra Fabre 1
- Thomas François 1
- Mohammed Ghennai 1
- Jérôme Goulian 1
- Shuyue Gu 1
- Qianwen Guan 1
- Felix E. Herron 1
- Nicolas Hervé 1
- Mathieu Lafourcade 1
- Steffen Lalande 1
- Phuong-Hang Le 1
- Xingyu Liu 1
- Cécile Macaire 1
- Salima Mdhaffar 1
- Kirill Milintsevich 1
- Aurélie Nardy 1
- Magali Norré 1
- Etienne Ollion 1
- Valentin Pelloin 1
- Maxime Peyrard 1
- Adrien Pupier 1
- Fabien Ringeval 1
- Solange Rossato 1
- Gilles Sérasset 1
- Vincent Vandeghinste 1
- Yongxin Zhou 1