Mathieu Constant

Other people with similar names: Matthieu Constant

Unverified author pages with similar names: Mathieu Constant


2026

The variation in writing style encapsulates nuanced characteristics, which are often exploited for author or demographic identification. In the medical domain, language models are frequently deployed to capture relevant information from unstructured or complex data, such as clinical notes that often include patients’ medical histories. Such data is largely free-form and unstructured, obtained through diverse clinician?patient interactions. In this work, we present a case study investigating whether variations in clinicians’ writing styles can lead to differences in medical context understanding capabilities for pre-trained language models (PLMs) on downstream tasks, such as medical event classification. Our findings indicate that variation in writing style, characterized by linguistic features, can indeed lead to suboptimal performance in deployed systems. Furthermore, we explore linguistic guided counterfactual reasoning in order to mitigate the impact of writing style variation which suggests LLM-based stylistic normalization to be effective for this purpose.
Highly technical medical terms are difficult for patients to understand during fast-paced hospital consultations, leading them to rely on Large Language Models (LLMs) for simplified explanations. However, LLMs can produce inaccurate or false information. Since expert evaluation is costly and time-consuming, LLM-as-a-Judge (LaaJ) approach is increasingly adopted to assess the quality of LLM-generated text. In this paper, we investigate the reliability and robustness of LaaJ for specialized medical knowledge by evaluating six LLMs for their judgment capabilities on three dimensions: correctness, readability, and completeness. We utilized three judgment setups: Vanilla, Epistemic, and Bias to probe robustness, and assess them against human expert annotations to measure alignment. To address the lack of specialized medical benchmarks, we introduce BrainCancerDB, an English dataset of 219 brain cancer terms with 23,652 annotations. Our findings indicate that while LLM-Judges and humans display similar trends in ranking simplified explanations, LLM-Judges tend to be more lenient on correctness, which may have serious implications in medical setting. Additionally, we observe that hallucinations in LaaJ setups can be mitigated by epistemic markers.
While humans can easily produce various types of answers, such as definitions, examples or paraphrases, Large Language Models (LLMs) struggle to provide correct answers to medical questions that require diverse answer formats. In this paper, we introduce TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also propose RefoMed-EN, a medical dataset consisting of 6,170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We investigated whether the high or low frequency of a concept (head or tail knowledge) impacts the language model’s performance for answering medical questions. We evaluated the quality of the LLM’s output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM’s answer quality for definition-type questions is the highest, while for the exemplification-type being the lowest. Additionally, we showed that for definition-type medical questions ("What is multiple sclerosis?"), LLMs are prone to paraphrase more for popular medical concepts, and less on more specialized medical knowledge.