Pacôme Constant dit Beaufils

Also published as: Pacome Constant Dit Beaufils

2026

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning
Ikram Belmadani | Oumaima El Khettari | Pacome Constant Dit Beaufils | Benoit Favre | Richard Dufour
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction–response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.

pdf bib abs

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
Ikram Belmadani | Oumaima El Khettari | Pacôme Constant dit Beaufils | Richard Dufour | Benoit Favre
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)

Automatic evaluation of open-ended question answering in specialized domains remains challenging mainly because it relies on manual annotations from domain experts. In this work, we assess the ability of several large language models (LLMs), including closed-access (GPT-5.1, Gemini-2.5-Pro), open-source general-purpose (Qwen-80B), and biomedical domain-adapted models (MedGemma-27B, Phi-3.5-mini variants), to act as automatic evaluators of semantic equivalence in French medical open-ended QA. Our analysis reveals that LLM-based judgments are sensitive to the source of answer generation: judgement correlation varies substantially across different generator models. Among the judges, MedGemma-27B and Qwen-80B achieve the highest agreement with expert annotations in terms of F1 score and Pearson correlation. We further explore lightweight adaptation strategies on Phi-3.5-mini using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). Even with 184 training instances, these adaptations significantly improve Phi-3.5’s results and reduce variability across answer generators, achieving performance comparable to larger domain-adapted models. Our results highlight the importance of generator-aware evaluation, the limitations of general-purpose LLMs in domain-specific settings, and the effectiveness of lightweight adaptation for compact models in low-resource scenarios.

2024

pdf bib abs

The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for the assessment of intrinsic PLMs qualities from various perspectives. Although still limited to few languages, this initiative has been undertaken in the biomedical field, notably English and Chinese. This limitation hampers the evaluation of the latest French biomedical models, as they are either assessed on a minimal number of tasks with non-standardized protocols or evaluated using general downstream tasks. To bridge this research gap and account for the unique sensitivities of French, we present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, or classification. We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specific MLMs to assess their cross-lingual capabilities. Our experiments reveal that no single model excels across all tasks, while generalist models are sometimes still competitive.

Co-authors

Béatrice Daille 1

Pierre-Antoine Gourraud 1

Venues

Fix author