Clément Christophe

Also published as: Clement Christophe

2025

Large language models offer transformative potential for healthcare, yet their responsible and equitable development depends critically on a deeper understanding of how training data characteristics influence model behavior, including the potential for bias. Current practices in dataset curation and bias assessment often lack the necessary transparency, creating an urgent need for comprehensive evaluation frameworks to foster trust and guide improvements. In this study, we present an in-depth analysis of potential downstream biases in clinical language models, with a focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age. As part of this investigation, we introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens. Our evaluation leverages both established general benchmarks and a novel, healthcare-specific methodology, offering crucial insights to support fairness and safety in clinical AI applications.

2024

pdf bib abs
Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs.
Clement Christophe | Tathagata Raha | Svetlana Maslenkova | Muhammad Umar Salman | Praveenkumar Kanithi | Marco AF Pimentel | Shadab Khan
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) have demonstrated significant potential in revolutionizing clinical applications. In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. We employ these methods on Mistral 7B and Mixtral 8x7B models, leveraging a large-scale clinical pretraining dataset of 50 billion tokens and an instruct fine-tuning dataset of 500 million tokens. Our evaluation across various clinical tasks reveals nuanced insights. While continuous pretraining beyond 250 billion tokens yields marginal improvements, instruct fine-tuning emerges as a more influential factor. Notably, NEFTune, designed primarily to enhance generation quality, surprisingly demonstrates additional gains on our benchmark. These findings underscore the importance of tailoring fine-tuning strategies and exploring innovative techniques to optimize LLM performance in the clinical domain.

2021

pdf bib abs
Monitoring geometrical properties of word embeddings for detecting the emergence of new topics.
Clément Christophe | Julien Velcin | Jairo Cugliari | Manel Boumghar | Philippe Suignard
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Slow emerging topic detection is a task between event detection, where we aggregate behaviors of different words on short period of time, and language evolution, where we monitor their long term evolution. In this work, we tackle the problem of early detection of slowly emerging new topics. To this end, we gather evidence of weak signals at the word level. We propose to monitor the behavior of words representation in an embedding space and use one of its geometrical properties to characterize the emergence of topics. As evaluation is typically hard for this kind of task, we present a framework for quantitative evaluation and show positive results that outperform state-of-the-art methods. Our method is evaluated on two public datasets of press and scientific articles.

pdf bib abs
Participation d’EDF R&D à DEFT 2021 (EDF R&D Participation to DEFT 2021)
Philippe Suignard | Alexandra Benamar | Nazim Messous | Clément Christophe | Marie Jubault | Meryl Bothua
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Atelier DÉfi Fouille de Textes (DEFT)

Ce papier présente la participation d’EDF R&D à la campagne d’évaluation DEFT 2021. Notre équipe a participé aux deux dernières tâches proposées (T2 et T3), deux tâches sur le calcul de similarité sémantique entre textes courts, et s’est classée 1ère sur ces deux tâches. Cette édition proposait deux nouvelles tâches pour l’évaluation automatique de réponses d’étudiants à des questions d’enseignants. Le corpus se composait d’une centaine d’énoncés en informatique avec la correction de l’enseignant et les réponses d’une cinquantaine d’étudiants en moyenne par question, sur 2 ans. La tâche 2 consistait à évaluer les réponses des étudiants en prenant pour référence la correction produite par l’enseignant et la tâche 3 à évaluer les réponses d’étudiants à partir d’un ensemble composé d’un énoncé et de plusieurs réponses d’étudiants déjà corrigées par l’enseignant.e.