Véronique Moriceau

Also published as: Veronique Moriceau

2021

pdf bib abs
“Be nice to your wife! The restaurants are closed”: Can Gender Stereotype Detection Improve Sexism Classification?
Patricia Chiril | Farah Benamara | Véronique Moriceau
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper, we focus on the detection of sexist hate speech against women in tweets studying for the first time the impact of gender stereotype detection on sexism classification. We propose: (1) the first dataset annotated for gender stereotype detection, (2) a new method for data augmentation based on sentence similarity with multilingual external datasets, and (3) a set of deep learning experiments first to detect gender stereotypes and then, to use this auxiliary task for sexism detection. Although the presence of stereotypes does not necessarily entail hateful content, our results show that sexism classification can definitively benefit from gender stereotype detection.

2020

Social media networks have become a space where users are free to relate their opinions and sentiments which may lead to a large spreading of hatred or abusive messages which have to be moderated. This paper presents the first French corpus annotated for sexism detection composed of about 12,000 tweets. In a context of offensive content mediation on social media now regulated by European laws, we think that it is important to be able to detect automatically not only sexist content but also to identify if a message with a sexist content is really sexist (i.e. addressed to a woman or describing a woman or women in general) or is a story of sexism experienced by a woman. This point is the novelty of our annotation scheme. We also propose some preliminary results for sexism detection obtained with a deep learning approach. Our experiments show encouraging results.

pdf bib abs
He said “who’s gonna take care of your children when you are at ACL?”: Reported Sexist Acts are Not Sexist
Patricia Chiril | Véronique Moriceau | Farah Benamara | Alda Mari | Gloria Origgi | Marlène Coulomb-Gully
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In a context of offensive content mediation on social media now regulated by European laws, it is important not only to be able to automatically detect sexist content but also to identify if a message with a sexist content is really sexist or is a story of sexism experienced by a woman. We propose: (1) a new characterization of sexist content inspired by speech acts theory and discourse analysis studies, (2) the first French dataset annotated for sexism detection, and (3) a set of deep learning experiments trained on top of a combination of several tweet’s vectorial representations (word embeddings, linguistic features, and various generalization strategies). Our results are encouraging and constitute a first step towards offensive content moderation.

2019

pdf bib abs
Multilingual and Multitarget Hate Speech Detection in Tweets
Patricia Chiril | Farah Benamara Zitoune | Véronique Moriceau | Marlène Coulomb-Gully | Abhishek Kumar
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

Social media networks have become a space where users are free to relate their opinions and sentiments which may lead to a large spreading of hatred or abusive messages which have to be moderated. This paper proposes a supervised approach to hate speech detection from a multilingual perspective. We focus in particular on hateful messages towards two different targets (immigrants and women) in English tweets, as well as sexist messages in both English and French. Several models have been developed ranging from feature-engineering approaches to neural ones. Our experiments show very encouraging results on both languages.

pdf bib abs
The binary trio at SemEval-2019 Task 5: Multitarget Hate Speech Detection in Tweets
Patricia Chiril | Farah Benamara Zitoune | Véronique Moriceau | Abhishek Kumar
Proceedings of the 13th International Workshop on Semantic Evaluation

The massive growth of user-generated web content through blogs, online forums and most notably, social media networks, led to a large spreading of hatred or abusive messages which have to be moderated. This paper proposes a supervised approach to hate speech detection towards immigrants and women in English tweets. Several models have been developed ranging from feature-engineering approaches to neural ones.

2017

pdf bib abs
Exploring the Impact of Pragmatic Phenomena on Irony Detection in Tweets: A Multilingual Corpus Study
Jihen Karoui | Farah Benamara | Véronique Moriceau | Viviana Patti | Cristina Bosco | Nathalie Aussenac-Gilles
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This paper provides a linguistic and pragmatic analysis of the phenomenon of irony in order to represent how Twitter’s users exploit irony devices within their communication strategies for generating textual contents. We aim to measure the impact of a wide-range of pragmatic phenomena in the interpretation of irony, and to investigate how these phenomena interact with contexts local to the tweet. Informed by linguistic theories, we propose for the first time a multi-layered annotation schema for irony and its application to a corpus of French, English and Italian tweets. We detail each layer, explore their interactions, and discuss our results according to a qualitative and quantitative perspective.

2016

pdf bib
LIMSI at SemEval-2016 Task 12: machine-learning and temporal information to identify clinical events and time expressions
Cyril Grouin | Véronique Moriceau
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib abs
Identification de facteurs de risque pour des patients diabétiques à partir de comptes-rendus cliniques par des approches hybrides
Cyril Grouin | Véronique Moriceau | Sophie Rosset | Pierre Zweigenbaum
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous présentons les méthodes que nous avons développées pour analyser des comptes- rendus hospitaliers rédigés en anglais. L’objectif de cette étude consiste à identifier les facteurs de risque de décès pour des patients diabétiques et à positionner les événements médicaux décrits par rapport à la date de création de chaque document. Notre approche repose sur (i) HeidelTime pour identifier les expressions temporelles, (ii) des CRF complétés par des règles de post-traitement pour identifier les traitements, les maladies et facteurs de risque, et (iii) des règles pour positionner temporellement chaque événement médical. Sur un corpus de 514 documents, nous obtenons une F-mesure globale de 0,8451. Nous observons que l’identification des informations directement mentionnées dans les documents se révèle plus performante que l’inférence d’informations à partir de résultats de laboratoire.

pdf bib abs
Détection automatique de l’ironie dans les tweets en français
Jihen Karoui | Farah Benamara Zitoune | Véronique Moriceau | Nathalie Aussenac-Gilles | Lamia Hadrich Belguith
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente une méthode par apprentissage supervisé pour la détection de l’ironie dans les tweets en français. Un classifieur binaire utilise des traits de l’état de l’art dont les performances sont reconnues, ainsi que de nouveaux traits issus de notre étude de corpus. En particulier, nous nous sommes intéressés à la négation et aux oppositions explicites/implicites entre des expressions d’opinion ayant des polarités différentes. Les résultats obtenus sont encourageants.

pdf bib abs
Médicaments qui soignent, médicaments qui rendent malades : étude des relations causales pour identifier les effets secondaires
François Morlane-Hondère | Cyril Grouin | Véronique Moriceau | Pierre Zweigenbaum
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous nous intéressons à la manière dont sont exprimés les liens qui existent entre un traitement médical et un effet secondaire. Parce que les patients se tournent en priorité vers internet, nous fondons cette étude sur un corpus annoté de messages issus de forums de santé en français. L’objectif de ce travail consiste à mettre en évidence des éléments linguistiques (connecteurs logiques et expressions temporelles) qui pourraient être utiles pour des systèmes automatiques de repérage des effets secondaires. Nous observons que les modalités d’écriture sur les forums ne permettent pas de se fonder sur les expressions temporelles. En revanche, les connecteurs logiques semblent utiles pour identifier les effets secondaires.

pdf bib
Towards a Contextual Pragmatic Model to Detect Irony in Tweets
Jihen Karoui | Farah Benamara Zitoune | Véronique Moriceau | Nathalie Aussenac-Gilles | Lamia Hadrich Belguith
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib abs
French Resources for Extraction and Normalization of Temporal Expressions with HeidelTime
Véronique Moriceau | Xavier Tannier
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe the development of French resources for the extraction and normalization of temporal expressions with HeidelTime, a open-source multilingual, cross-domain temporal tagger. HeidelTime extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard. Several types of temporal expressions are extracted: dates, times, durations and temporal sets. French resources have been evaluated in two different ways: on the French TimeBank corpus, a corpus of newspaper articles in French annotated according to the ISO-TimeML standard, and on a user application for automatic building of event timelines. Results on the French TimeBank are quite satisfaying as they are comparable to those obtained by HeidelTime in English and Spanish on newswire articles. Concerning the user application, we used two temporal taggers for the preprocessing of the corpus in order to compare their performance and results show that the performances of our application on French documents are better with HeidelTime. The French resources and evaluation scripts are publicly available with HeidelTime.

pdf bib
Fine-grained semantic categorization of opinion expressions for consensus detection (Catégorisation sémantique fine des expressions d’opinion pour la détection de consensus) [in French]
Farah Benamara | Véronique Moriceau | Yvette Yannick Mathieu
TALN-RECITAL 2014 Workshop DEFT 2014 : DÉfi Fouille de Textes (DEFT 2014 Workshop: Text Mining Challenge)

pdf bib
Ranking Multidocument Event Descriptions for Building Thematic Timelines
Kiem-Hieu Nguyen | Xavier Tannier | Veronique Moriceau
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
User evaluation of a multiple answer extraction system on the Web (Évaluation d’un système d’extraction de réponses multiples sur le Web par comparaison à des humains) [in French]
Mathieu-Henri Falco | Véronique Moriceau | Anne Vilnat
Proceedings of TALN 2014 (Volume 2: Short Papers)

2013

pdf bib
Building Event Threads out of Multiple News Articles
Xavier Tannier | Véronique Moriceau
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
An Interface for Validating and Evaluating Thematic Timelines (Une interface pour la validation et l’évaluation de chronologies thématiques) [in French]
Xavier Tannier | Véronique Moriceau | Erwan Le Flem
Proceedings of TALN 2013 (Volume 3: System Demonstrations)

2012

pdf bib
Finding Salient Dates for Building Thematic Timelines
Rémy Kessler | Xavier Tannier | Caroline Hagège | Véronique Moriceau | André Bittar
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib abs
Temporal Annotation: A Proposal for Guidelines and an Experiment with Inter-annotator Agreement
André Bittar | Caroline Hagège | Véronique Moriceau | Xavier Tannier | Charles Teissèdre
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This article presents work carried out within the framework of the ongoing ANR (French National Research Agency) project Chronolines, which focuses on the temporal processing of large news-wire corpora in English and French. The aim of the project is to create new and innovative interfaces for visualizing textual content according to temporal criteria. Extracting and normalizing the temporal information in texts through linguistic annotation is an essential step towards attaining this objective. With this goal in mind, we developed a set of guidelines for the annotation of temporal and event expressions that is intended to be compatible with the TimeML markup language, while addressing some of its pitfalls. We provide results of an initial application of these guidelines to real news-wire texts in French over several iterations of the annotation process. These results include inter-annotator agreement figures and an error analysis. Our final inter-annotator agreement figures compare favorably with those reported for the TimeBank 1.2 annotation project.

pdf bib abs
Evolution of Event Designation in Media: Preliminary Study
Xavier Tannier | Véronique Moriceau | Béatrice Arnulphy | Ruixin He
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Within the general purpose of information extraction, detection of event descriptions is often an important clue. An important characteristic of event designation in texts, and especially in media, is that it changes over time. Understanding how these designations evolve is important in information retrieval and information extraction. Our first hypothesis is that, when an event first occurs, media relate it in a very descriptive way (using verbal designations) whereas after some time, they use shorter nominal designations instead. Our second hypothesis is that the number of different nominal designations for an event tends to stabilize itself over time. In this article, we present our methodology concerning the study of the evolution of event designations in French documents from the news agency AFP. For this preliminary study, we focused on 7 topics which have been relatively important in France. Verbal and nominal designations of events have been manually annotated in manually selected topic-related passages. This French corpus contains a total of 2064 annotations. We then provide preliminary interesting statistical results and observations concerning these evolutions.

pdf bib abs
Kitten: a tool for normalizing HTML and extracting its textual content
Mathieu-Henri Falco | Véronique Moriceau | Anne Vilnat
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The web is composed of a gigantic amount of documents that can be very useful for information extraction systems. Most of them are written in HTML and have to be rendered by an HTML engine in order to display the data they contain on a screen. HTML file thus mix both informational and rendering content. Our goal is to design a tool for informational content extraction. A linear extraction with only a basic filtering of rendering content would not be enough as objects such as lists and tables are linearly coded but need to be read in a non-linear way to be well interpreted. Besides these HTML pages are often incorrectly coded from an HTML point of view and use a segmentation of blocks based on blank space that cannot be transposed in a text filewithout confusing syntactic parsers. For this purpose, we propose the Kitten tool that first normalizes HTML file into unicode XHTML file, then extracts the informational content into a text filewith a special processing for sentences, lists and tables.

2011

pdf bib abs
Génération automatique de questions à partir de textes en français (Automatic generation of questions from texts in French)
Louis de Viron | Delphine Bernhard | Véronique Moriceau | Xavier Tannier
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Nous présentons dans cet article un générateur automatique de questions pour le français. Le système de génération procède par transformation de phrases déclaratives en interrogatives et se base sur une analyse syntaxique préalable de la phrase de base. Nous détaillons les différents types de questions générées. Nous présentons également une évaluation de l’outil, qui démontre que 41 % des questions générées par le système sont parfaitement bien formées.

2010

pdf bib abs
FIDJI: Web Question-Answering at Quaero 2009
Xavier Tannier | Véronique Moriceau
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents the participation of FIDJI system to the Web Question-Answering evaluation campaign organized by Quaero in 2009. FIDJI is an open-domain question-answering system which combines syntactic information with traditional QA techniques such as named entity recognition and term weighting in order to validate answers through multiple documents. It was originally designed to process ``clean'' document collections. Overall results are significantly lower than in traditional campaigns but results (for French evaluation) are quite good compared to other state-of-the-art systems. They show that a syntax-based strategy, applied on uncleaned Web data, can still obtain good results. Moreover, we obtain much higher scores on ``complex'' questions, i.e. `how' and `why' questions, which are more representative of real user needs. These results show that questioning the Web with advanced linguistic techniques can be done without heavy pre-processing and with results that come near to best systems that use strong resources and large structured indexes.

In the QA and information retrieval domains progress has been assessed via evaluation campaigns(Clef, Ntcir, Equer, Trec).In these evaluations, the systems handle independent questions and should provide one answer to each question, extracted from textual data, for both open domain and restricted domain. Quæro is a program promoting research and industrial innovation on technologies for automatic analysis and classification of multimedia and multilingual documents. Among the many research areas concerned by Quæro. The Quaero project organized a series of evaluations of Question Answering on Web Data systems in 2008 and 2009. For each language, English and French the full corpus has a size of around 20Gb for 2.5M documents. We describe the task and corpora, and especially the methodologies used in 2008 to construct the test of question and a new one in the 2009 campaign. Six types of questions were addressed, factual, Non-factual(How, Why, What), List, Boolean. A description of the participating systems and the obtained results is provided. We show the difficulty for a question-answering system to work with complex data and questions.

Question answering (QA) systems aim at retrieving precise information from a large collection of documents. To be considered as reliable by users, a QA system must provide elements to evaluate the answer. This notion of answer justification can also be useful when developping a QA system in order to give criteria for selecting correct answers. An answer justification can be found in a sentence, a passage made of several consecutive sentences or several passages of a document or several documents. Thus, we are interesting in pinpointing the set of information that allows to verify the correctness of the answer in a candidate passage and the question elements that are missing in this passage. Moreover, the relevant information is often given in texts in a different form from the question form: anaphora, paraphrases, synonyms. In order to have a better idea of the importance of all the phenomena we underlined, and to provide enough examples at the QA developer's disposal to study them, we decided to build an annotated corpus.

pdf bib abs
Une étude des questions “complexes” en question-réponse
Véronique Moriceau | Xavier Tannier | Mathieu Falco
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

La plupart des systèmes de question-réponse ont été conçus pour répondre à des questions dites “factuelles” (réponses précises comme des dates, des lieux), et peu se sont intéressés au traitement des questions complexes. Cet article présente une typologie des questions en y incluant les questions complexes, ainsi qu’une typologie des formes de réponses attendues pour chaque type de questions. Nous présentons également des expériences préliminaires utilisant ces typologies pour les questions complexes, avec de bons résultats.

2009

pdf bib abs
Apport de la syntaxe dans un système de question-réponse : étude du système FIDJI.
Véronique Moriceau | Xavier Tannier
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente une série d’évaluations visant à étudier l’apport d’une analyse syntaxique robuste des questions et des documents dans un système de questions-réponses. Ces évaluations ont été effectuées sur le système FIDJI, qui utilise à la fois des informations syntaxiques et des techniques plus “traditionnelles”. La sélection des documents, l’extraction de la réponse ainsi que le comportement selon les différents types de questions ont été étudiés.

2006

pdf bib abs
Language Challenges for Data Fusion in Question-Answering
Véronique Moriceau
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Search engines on the web and most existing question-answering systems provide the user with a set of hyperlinks and/or web page extracts containing answer(s) to a question. These answers are often incoherent to a certain degree (equivalent, contradictory, etc.). It is then quite difficult for the user to know which answer is the correct one. In this paper, we present an approach which aims at providing synthetic numerical answers in a question-answering system. These answers are generated in natural language and, in a cooperative perspective, the aim is to explain to the user the variation of numerical values when several values, apparently incoherent, are extracted from the web as possible answers to a question. We present in particular how lexical resources are essential to answer extraction from the web, to the characterization of the variation mode associated with the type of information and to answer generation in natural language.

pdf bib
Generating Intelligent Numerical Answers in a Question-Answering System
Véronique Moriceau
Proceedings of the Fourth International Natural Language Generation Conference

pdf bib
Numerical Data Integration for Cooperative Question-Answering
Véronique Moriceau
Proceedings of the Workshop KRAQ’06: Knowledge and Reasoning for Language Processing