2024
pdf
abs
Cloaked Classifiers: Pseudonymization Strategies on Sensitive Classification Tasks
Arij Riabi
|
Menel Mahamdi
|
Virginie Mouilleron
|
Djamé Seddah
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing
Protecting privacy is essential when sharing data, particularly in the case of an online radicalization dataset that may contain personal information. In this paper, we explore the balance between preserving data usefulness and ensuring robust privacy safeguards, since regulations like the European GDPR shape how personal information must be handled. We share our method for manually pseudonymizing a multilingual radicalization dataset, ensuring performance comparable to the original data. Furthermore, we highlight the importance of establishing comprehensive guidelines for processing sensitive NLP data by sharing our complete pseudonymization process, our guidelines, the challenges we encountered as well as the resulting dataset.
2023
pdf
bib
abs
Towards a Robust Detection of Language Model-Generated Text: Is ChatGPT that easy to detect?
Wissam Antoun
|
Virginie Mouilleron
|
Benoît Sagot
|
Djamé Seddah
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs
Recent advances in natural language processing (NLP) have led to the development of large language models (LLMs) such as ChatGPT. This paper proposes a methodology for developing and evaluating ChatGPT detectors for French text, with a focus on investigating their robustness on out-of-domain data and against common attack schemes. The proposed method involves translating an English dataset into French and training a classifier on the translated data. Results show that the detectors can effectively detect ChatGPT-generated text, with a degree of robustness against basic attack techniques in in-domain settings. However, vulnerabilities are evident in out-of-domain contexts, highlighting the challenge of detecting adversarial text. The study emphasizes caution when applying in-domain testing results to a wider variety of content. We provide our translated datasets and models as open-source resources.
2020
pdf
The FinSim 2020 Shared Task: Learning Semantic Representations for the Financial Domain
Ismail El Maarouf
|
Youness Mansar
|
Virginie Mouilleron
|
Dialekti Valsamou-Stanislawski
Proceedings of the Second Workshop on Financial Technology and Natural Language Processing
pdf
bib
abs
The Financial Document Structure Extraction Shared task (FinToc 2020)
Najah-Imane Bentabet
|
Rémi Juge
|
Ismail El Maarouf
|
Virginie Mouilleron
|
Dialekti Valsamou-Stanislawski
|
Mahmoud El-Haj
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation
This paper presents the FinTOC-2020 Shared Task on structure extraction from financial documents, its participants results and their findings. This shared task was organized as part of The 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation (FNP-FNS 2020), held at The 28th International Conference on Computational Linguistics (COLING’2020). This shared task aimed to stimulate research in systems for extracting table-of-contents (TOC) from investment documents (such as financial prospectuses) by detecting the document titles and organizing them hierarchically into a TOC. For the second edition of this shared task, two subtasks were presented to the participants: one with English documents and the other one with French documents.
2016
pdf
abs
Radarly : écouter et analyser le web conversationnel en temps réel (Real time listening and analysis of the social web using Radarly)
Jade Copet
|
Christine de Carvalho
|
Virginie Mouilleron
|
Benoit Tabutiaux
|
Hugo Zanghi
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 5 : Démonstrations
De par le contexte conversationnel digital, l’outil Radarly a été conçu pour permettre de traiter de grands volumes de données hétérogènes en temps réel, de générer de nouveaux indicateurs et de les visualiser sur une interface cohérente et confortable afin d’en tirer des analyses et études pertinentes. Ce document expose les techniques et processus utilisés pour extraire et traiter toutes ces données.
2013
pdf
Dynamic extension of a French morphological lexicon based a text stream (Extension dynamique de lexiques morphologiques pour le français à partir d’un flux textuel) [in French]
Benoît Sagot
|
Damien Nouvel
|
Virginie Mouilleron
|
Marion Baranes
Proceedings of TALN 2013 (Volume 1: Long Papers)
2012
pdf
The French Social Media Bank: a Treebank of Noisy User Generated Content
Djamé Seddah
|
Benoit Sagot
|
Marie Candito
|
Virginie Mouilleron
|
Vanessa Combet
Proceedings of COLING 2012