Virginie Mouilleron

2025

pdf bib abs
Beyond Dataset Creation: Critical View of Annotation Variation and Bias Probing of a Dataset for Online Radical Content Detection
Arij Riabi | Virginie Mouilleron | Menel Mahamdi | Wissam Antoun | Djamé Seddah
Proceedings of the 31st International Conference on Computational Linguistics

The proliferation of radical content on online platforms poses significant risks, including inciting violence and spreading extremist ideologies. Despite ongoing research, existing datasets and models often fail to address the complexities of multilingual and diverse data. To bridge this gap, we introduce a publicly available multilingual dataset annotated with radicalization levels, calls for action, and named entities in English, French, and Arabic. This dataset is pseudonymized to protect individual privacy while preserving contextual information. Beyond presenting our freely available dataset, we analyze the annotation process, highlighting biases and disagreements among annotators and their implications for model performance. Additionally, we use synthetic data to investigate the influence of socio-demographic traits on annotation patterns and model predictions. Our work offers a comprehensive examination of the challenges and opportunities in building robust datasets for radical content detection, emphasizing the importance of fairness and transparency in model development. The Counter dataset is available at https://gitlab.inria.fr/ariabi/counter-dataset-public.

2024

pdf bib abs
Cloaked Classifiers: Pseudonymization Strategies on Sensitive Classification Tasks
Arij Riabi | Menel Mahamdi | Virginie Mouilleron | Djamé Seddah
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing

Protecting privacy is essential when sharing data, particularly in the case of an online radicalization dataset that may contain personal information. In this paper, we explore the balance between preserving data usefulness and ensuring robust privacy safeguards, since regulations like the European GDPR shape how personal information must be handled. We share our method for manually pseudonymizing a multilingual radicalization dataset, ensuring performance comparable to the original data. Furthermore, we highlight the importance of establishing comprehensive guidelines for processing sensitive NLP data by sharing our complete pseudonymization process, our guidelines, the challenges we encountered as well as the resulting dataset.

2023

pdf bib abs
Towards a Robust Detection of Language Model-Generated Text: Is ChatGPT that easy to detect?
Wissam Antoun | Virginie Mouilleron | Benoît Sagot | Djamé Seddah
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs

Recent advances in natural language processing (NLP) have led to the development of large language models (LLMs) such as ChatGPT. This paper proposes a methodology for developing and evaluating ChatGPT detectors for French text, with a focus on investigating their robustness on out-of-domain data and against common attack schemes. The proposed method involves translating an English dataset into French and training a classifier on the translated data. Results show that the detectors can effectively detect ChatGPT-generated text, with a degree of robustness against basic attack techniques in in-domain settings. However, vulnerabilities are evident in out-of-domain contexts, highlighting the challenge of detecting adversarial text. The study emphasizes caution when applying in-domain testing results to a wider variety of content. We provide our translated datasets and models as open-source resources.

2020

pdf bib
The FinSim 2020 Shared Task: Learning Semantic Representations for the Financial Domain
Ismail El Maarouf | Youness Mansar | Virginie Mouilleron | Dialekti Valsamou-Stanislawski
Proceedings of the Second Workshop on Financial Technology and Natural Language Processing

pdf bib abs
The Financial Document Structure Extraction Shared task (FinToc 2020)
Najah-Imane Bentabet | Rémi Juge | Ismail El Maarouf | Virginie Mouilleron | Dialekti Valsamou-Stanislawski | Mahmoud El-Haj
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

This paper presents the FinTOC-2020 Shared Task on structure extraction from financial documents, its participants results and their findings. This shared task was organized as part of The 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation (FNP-FNS 2020), held at The 28th International Conference on Computational Linguistics (COLING’2020). This shared task aimed to stimulate research in systems for extracting table-of-contents (TOC) from investment documents (such as financial prospectuses) by detecting the document titles and organizing them hierarchically into a TOC. For the second edition of this shared task, two subtasks were presented to the participants: one with English documents and the other one with French documents.

2016

pdf bib abs
Radarly : écouter et analyser le web conversationnel en temps réel (Real time listening and analysis of the social web using Radarly)
Jade Copet | Christine de Carvalho | Virginie Mouilleron | Benoit Tabutiaux | Hugo Zanghi
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 5 : Démonstrations

De par le contexte conversationnel digital, l’outil Radarly a été conçu pour permettre de traiter de grands volumes de données hétérogènes en temps réel, de générer de nouveaux indicateurs et de les visualiser sur une interface cohérente et confortable afin d’en tirer des analyses et études pertinentes. Ce document expose les techniques et processus utilisés pour extraire et traiter toutes ces données.