Sarah Valentin
2026
Data Augmentation Based on Selective Masking of Language Models for One Health Context
Youssef Mahdoubi | Najlae Idrissi | Mathieu Roche | Sarah Valentin
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)
Youssef Mahdoubi | Najlae Idrissi | Mathieu Roche | Sarah Valentin
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)
This study focuses on improving the performance of language models for two critical applications within the One Health context, specifically in epidemiological monitoring using textual data: (i) thematic classification across syndromic surveillance, biomedical and plant health domains, and (ii) detection of epidemic misinformation. A key challenge in these tasks is the limited availability of labeled textual data, which constrains the effectiveness of supervised learning methods. To overcome this limitation, we introduce two families of selective masking–based data augmentation strategies: lexical and non-lexical. Each family is implemented in a standard variant (Aug-SM-Lex and Aug-SM-NonLex), and a TF-IDF-weighted variant (Aug-SM-Lex-TFIDF and Aug-SM-NonLex-TFIDF). We perform two complementary experiments: the first determines the optimal masking rate, while the second evaluates the proposed strategies against LLM-based text reformulation. Experimental results indicate that selective masking-based augmentation outperformed both LLM-based reformulation (Mistral-7B and GPT-Neo-1.3B) and baseline models trained on original data alone across three of the five evaluated datasets, with the best performance achieved at a masking rate of 20%. This suggests that selective masking is a promising approach, potentially more effective than computationally expensive LLM-based reformulation.
2025
Analyse Textuelle et Extraction Géospatiale pour la Surveillance des Crises Alimentaires en Afrique de l’Ouest
Charles Abdoulaye Ngom | Maguelonne Teisseire | Sarah Valentin
Actes de la 20e Conférence en Recherche d’Information et Applications (CORIA)
Charles Abdoulaye Ngom | Maguelonne Teisseire | Sarah Valentin
Actes de la 20e Conférence en Recherche d’Information et Applications (CORIA)
L’Afrique de l’Ouest fait face à une insécurité alimentaire récurrente, exacerbée par les conflits, le changement climatique et les chocs économiques. La collecte d’informations à une échelle spatiotemporelle appropriée est essentielle au suivi des crises liées à la sécurité alimentaire. Dans ce travail, nous nous intéressons à l’extraction géospatiale à partir de données textuelles, tâche qui s’inscrit dans une approche globale de suivi des crises alimentaires à partir d’articles de presse. Nous évaluons deux modèles d’extraction d’entités spatiales, GLiNER et CamemBERT, en configuration zéro-shot et après ajustement ( fine-tuning ), sur un corpus de 15 000 articles de presse en français couvrant l’actualité du Burkina Faso.
2020
Information retrieval for animal disease surveillance: a pattern-based approach.
Sarah Valentin | Mathieu Roche | Renaud Lancelot
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis
Sarah Valentin | Mathieu Roche | Renaud Lancelot
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis
Animal diseases-related news articles are richin information useful for risk assessment. In this paper, we explore a method to automatically retrieve sentence-level epidemiological information. Our method is an incremental approach to create and expand patterns at both lexical and syntactic levels. Expert knowledge input are used at different steps of the approach. Distributed vector representations (word embedding) were used to expand the patterns at the lexical level, thus alleviating manual curation. We showed that expert validation was crucial to improve the precision of automatically generated patterns.
Automated Processing of Multilingual Online News for the Monitoring of Animal Infectious Diseases
Sarah Valentin | Renaud Lancelot | Mathieu Roche
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)
Sarah Valentin | Renaud Lancelot | Mathieu Roche
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)
The Platform for Automated extraction of animal Disease Information from the web (PADI-web) is an automated system which monitors the web for monitoring and detecting emerging animal infectious diseases. The tool automatically collects news via customised multilingual queries, classifies them and extracts epidemiological information. We detail the processing of multilingual online sources by PADI-web and analyse the translated outputs in a case study