Qianwen Guan
2026
BenCSSmark: Making the Social Sciences Count in LLM Research
Arnault Chatelain | Etienne Ollion | Qianwen Guan | Diandra Fabre | Lorraine Goeuriot | Emile Chapuis | Abdelkrim Beloued | Marie Candito | Nicolas Hervé | Didier Schwab
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Arnault Chatelain | Etienne Ollion | Qianwen Guan | Diandra Fabre | Lorraine Goeuriot | Emile Chapuis | Abdelkrim Beloued | Marie Candito | Nicolas Hervé | Didier Schwab
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This position paper argues that the under-representation of social science tasks in contemporary LLM benchmarks limits advances in both LLM evaluation and social scientific inquiry. Benchmarks — standardized tools for assessing computational systems — are pivotal in the development of artificial intelligence (AI), including large language models (LLMs). Benchmarks do more than measure progress — they actively structure it, shaping reputations, research agendas, and commercial outcomes. Despite this central role, the social sciences are largely absent from mainstream evaluation frameworks, even though scholars in these fields generate dozens of rigorously annotated, context-sensitive datasets each year. Integrating this work into benchmark design could significantly improve the generalization and robustness of AI models. In turn, models trained on social scientific tasks would likely yield better performance on classic and contemporary tasks in disciplines as diverse as history, sociology, political science or economics. This is all the more pressing as these disciplines are quickly turning to LLMs for assistance. To address this gap, we introduce BenCSSmark, a benchmark composed of datasets annotated by computational social scientists. By integrating social scientific perspectives into benchmarking, BenCSSmark seeks to promote more robust, transparent, and socially relevant AI systems and to foster efficient collaboration.
Pantagruel: Unified Self-Supervised Encoders for French Text and Speech
Phuong-Hang Le | Valentin Pelloin | Arnault Chatelain | Maryem Bouziane | Mohammed Ghennai | Qianwen Guan | Kirill Milintsevich | Salima Mdhaffar | Aidan Mannion | Nils Defauw | Shuyue Gu | Alexandre Daniel Audibert | Marco Dinarelli | Yannick Estève | Lorraine Goeuriot | Steffen Lalande | Nicolas Hervé | Maximin Coavoux | François Portet | Étienne Ollion | Marie Candito | Maxime Peyrard | Solange Rossato | Benjamin Lecouteux | Aurélie Nardy | Gilles Sérasset | Vincent Segonne | Solène Evain | Diandra Fabre | Didier Schwab
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Phuong-Hang Le | Valentin Pelloin | Arnault Chatelain | Maryem Bouziane | Mohammed Ghennai | Qianwen Guan | Kirill Milintsevich | Salima Mdhaffar | Aidan Mannion | Nils Defauw | Shuyue Gu | Alexandre Daniel Audibert | Marco Dinarelli | Yannick Estève | Lorraine Goeuriot | Steffen Lalande | Nicolas Hervé | Maximin Coavoux | François Portet | Étienne Ollion | Marie Candito | Maxime Peyrard | Solange Rossato | Benjamin Lecouteux | Aurélie Nardy | Gilles Sérasset | Vincent Segonne | Solène Evain | Diandra Fabre | Didier Schwab
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l’Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.
2016
La perception des séquences consonantiques non-natives par les locuteurs monolingues de mandarin (Perception of non-native consonant sequences by Mandarin monolingual speakers)
Qianwen Guan | Harim Kwon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP
Qianwen Guan | Harim Kwon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP
Cette étude examine le rôle de la structure phonotactique native et des facteurs phonétiques dans la perception des séquences consonantiques non-natives. Des locuteurs monolingues de mandarin ont été testés dans les deux expériences suivantes: dans la première expérience, les locuteurs ont du décider s’ils entendaient une voyelle entre deux consonnes en écoutant des séquences intervocaliques-CC (akta) et leurs contrôles CVC (akata). Les participants mandarins monolingues ont tendance à percevoir une voyelle entre deux consonnes dans les deux séquences CC et CVC. Mais le pourcentage de la voyelle perçue varie selon les différentes séquences. Dans la deuxième expérience, les mêmes participants ont écouté des séquences CC initiales et intervocaliques (ktapa, akta) ainsi que CVC (katapa, akata) et les ont transcrites en Pinyin. Les stratégies observées dans la transcription: l’épenthèse, la métathèse, l’omission de C1 et celle de C2, montrent que les participants sont sensibles aux facteurs phonétiques. Les résultats des deux expériences suggèrent que la phonotactique native ainsi que des facteurs phonétiques affectent la perception des séquences non-natives.
Search
Fix author
Co-authors
- Marie Candito 2
- Arnault Chatelain 2
- Diandra Fabre 2
- Lorraine Goeuriot 2
- Nicolas Hervé 2
- Etienne Ollion 2
- Didier Schwab 2
- Alexandre Daniel Audibert 1
- Abdelkrim Beloued 1
- Maryem Bouziane 1
- Emile Chapuis 1
- Maximin Coavoux 1
- Nils Defauw 1
- Marco Dinarelli 1
- Yannick Estève 1
- Solène Evain 1
- Mohammed Ghennai 1
- Shuyue Gu 1
- Harim Kwon 1
- Steffen Lalande 1
- Phuong-Hang Le 1
- Benjamin Lecouteux 1
- Aidan Mannion 1
- Salima Mdhaffar 1
- Kirill Milintsevich 1
- Aurélie Nardy 1
- Valentin Pelloin 1
- Maxime Peyrard 1
- François Portet 1
- Solange Rossato 1
- Vincent Segonne 1
- Gilles Sérasset 1