Caroline Brun

2024

pdf abs
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Caroline Brun | Vassilina Nikoulina
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language, which can have detrimental effects on individuals and communities. Although most efforts is put to assess and mitigate toxicity in generated content, it is primarily concentrated on English, while it’s essential to consider other languages as well. For addressing this issue, we create and release FrenchToxicityPrompts, a dataset of 50K naturally occurring French prompts and their continuations, annotated with toxicity scores from a widely used toxicity classifier. We evaluate 14 different models from four prevalent open-sourced families of LLMs against our dataset to assess their potential toxicity across various dimensions. We hope that our contribution will foster future research on toxicity detection and mitigation beyond English.

2023

Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP.

2022

pdf abs
SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages
Alireza Mohammadshahi | Vassilina Nikoulina | Alexandre Berard | Caroline Brun | James Henderson | Laurent Besacier
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

In recent years, multilingual machine translation models have achieved promising performance on low-resource language pairs by sharing information between similar languages, thus enabling zero-shot translation. To overcome the “curse of multilinguality”, these models often opt for scaling up the number of parameters, which makes their use in resource-constrained environments challenging. We introduce SMaLL-100, a distilled version of the M2M-100(12B) model, a massively multilingual machine translation model covering 100 languages. We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages. We evaluate SMaLL-100 on different low-resource benchmarks: FLORES-101, Tatoeba, and TICO-19 and demonstrate that it outperforms previous massively multilingual models of comparable sizes (200-600M) while improving inference latency and memory usage. Additionally, our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.

pdf abs
What Do Compressed Multilingual Machine Translation Models Forget?
Alireza Mohammadshahi | Vassilina Nikoulina | Alexandre Berard | Caroline Brun | James Henderson | Laurent Besacier
Findings of the Association for Computational Linguistics: EMNLP 2022

Recently, very large pre-trained models achieve state-of-the-art results in various natural language processing (NLP) tasks, but their size makes it more challenging to apply them in resource-constrained environments. Compression techniques allow to drastically reduce the size of the models and therefore their inference time with negligible impact on top-tier metrics. However, the general performance averaged across multiple tasks and/or languages may hide a drastic performance drop on under-represented features, which could result in the amplification of biases encoded by the models. In this work, we assess the impact of compression methods on Multilingual Neural Machine Translation models (MNMT) for various language groups, gender, and semantic biases by extensive analysis of compressed models on different machine translation benchmarks, i.e. FLORES-101, MT-Gender, and DiBiMT. We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases. Interestingly, the removal of noisy memorization with compression leads to a significant improvement for some medium-resource languages. Finally, we demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages.

2021

pdf abs
Semantic Context Path Labeling for Semantic Exploration of User Reviews
Salah Aït-Mokhtar | Caroline Brun | Yves Hoppenot | Agnes Sandor
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

In this paper we present a prototype demonstrator showcasing a novel method to perform semantic exploration of user reviews. The system enables effective navigation in a rich contextual semantic schema with a large number of structured classes indicating relevant information. In order to identify instances of the structured classes in the reviews, we defined a new Information Extraction task called Semantic Context Path (SCP) labeling, which simultaneously assigns types and semantic roles to entity mentions. Reviews can rapidly be explored based on the fine-grained and structured semantic classes. As a proof-of-concept, we have implemented this system for reviews on Points-of-Interest, in English and Korean.

2019

pdf
“Sentiment Aware Map” : exploration cartographique de points d’intérêt via l’analyse de sentiments au niveau des aspects ()
Ioan Calapodescu | Caroline Brun | Vassilina Nikoulina | Salah Aït-Mokhtar
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume IV : Démonstrations

2018

pdf abs
Transfert de ressources sémantiques pour l’analyse de sentiments au niveau des aspects (In this paper, we address the problem of automatic polarity detection in the context of Aspect Based)
Caroline Brun
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Dans cet article, nous abordons le problème de la détection de la polarité pour l’analyse de sentiments au niveau des aspects dans un contexte bilingue : nous proposons d’adapter le composant de détection de polarité d’un système préexistant d’analyse de sentiments au niveau des aspects, très performant pour la tâche, et reposant sur l’utilisation de ressources sémantiques riches pour une langue donnée, à une langue sémantiquement moins richement dotée. L’idée sous-jacente est de réduire le besoin de supervision nécessaire à la construction des ressources sémantiques essentielles à notre système. À cette fin, la langue source, peu dotée, est traduite vers la langue cible, et les traductions parallèles sont ensuite alignées mot à mot. Les informations sémantiques riches sont alors extraites de la langue cible par le système de détection de polarité, et ces informations sont ensuite alignées vers la langue source. Nous présentons les différentes étapes de cette expérience, ainsi que l’évaluation finale. Nous concluons par quelques perspectives.

pdf abs
Aspect Based Sentiment Analysis into the Wild
Caroline Brun | Vassilina Nikoulina
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In this paper, we test state-of-the-art Aspect Based Sentiment Analysis (ABSA) systems trained on a widely used dataset on actual data. We created a new manually annotated dataset of user generated data from the same domain as the training dataset, but from other sources and analyse the differences between the new and the standard ABSA dataset. We then analyse the results in performance of different versions of the same system on both datasets. We also propose light adaptation methods to increase system robustness.

2016

pdf
XRCE at SemEval-2016 Task 5: Feedbacked Ensemble Modeling on Syntactico-Semantic Knowledge for Aspect Based Sentiment Analysis
Caroline Brun | Julien Perez | Claude Roux
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
Steps Toward Automatic Understanding of the Function of Affective Language in Support Groups
Amit Navindgi | Caroline Brun | Cécile Boulard Masson | Scott Nowson
Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media

2015

pdf abs
Un système hybride pour l’analyse de sentiments associés aux aspects
Caroline Brun | Diana Nicoleta Popa | Claude Roux
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente en détails notre participation à la tâche 4 de SemEval2014 (Analyse de Sentiments associés aux Aspects). Nous présentons la tâche et décrivons précisément notre système qui consiste en une combinaison de composants linguistiques et de modules de classification. Nous exposons ensuite les résultats de son évaluation, ainsi que les résultats des meilleurs systèmes. Nous concluons par la présentation de quelques nouvelles expériences réalisées en vue de l’amélioration de ce système.

pdf abs
Etude de l’image de marque d’entités dans le cadre d’une plateforme de veille sur le Web social
Leila Khouas | Caroline Brun | Anne Peradotto | Jean-Valère Cossu | Julien Boyadjian | Julien Velcin
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Ce travail concerne l’intégration à une plateforme de veille sur internet d’outils permettant l’analyse des opinions émises par les internautes à propos d’une entité, ainsi que la manière dont elles évoluent dans le temps. Les entités considérées peuvent être des personnes, des entreprises, des marques, etc. Les outils implémentés sont le produit d’une collaboration impliquant plusieurs partenaires industriels et académiques dans le cadre du projet ANR ImagiWeb.

pdf
Motivating Personality-aware Machine Translation
Shachar Mirkin | Scott Nowson | Caroline Brun | Julien Perez
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

The objective of this paper is to describe the design of a dataset that deals with the image (i.e., representation, web reputation) of various entities populating the Internet: politicians, celebrities, companies, brands etc. Our main contribution is to build and provide an original annotated French dataset. This dataset consists of 11527 manually annotated tweets expressing the opinion on specific facets (e.g., ethic, communication, economic project) describing two French policitians over time. We believe that other researchers might benefit from this experience, since designing and implementing such a dataset has proven quite an interesting challenge. This design comprises different processes such as data selection, formal definition and instantiation of an image. We have set up a full open-source annotation platform. In addition to the dataset design, we present the first results that we obtained by applying clustering methods to the annotated dataset in order to extract the entity images.

pdf
XRCE: Hybrid Classification for Aspect-based Sentiment Analysis
Caroline Brun | Diana Nicoleta Popa | Claude Roux
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf
Part of Speech Tagging for French Social Media Data
Farhad Nooralahzadeh | Caroline Brun | Claude Roux
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf
Decomposing Hashtags to Improve Tweet Polarity Classification (Décomposition des « hash tags » pour l’amélioration de la classification en polarité des « tweets ») [in French]
Caroline Brun | Claude Roux
Proceedings of TALN 2014 (Volume 2: Short Papers)

2012

pdf bib
A Graphical User Interface for Feature-Based Opinion Mining
Pedro Paulo Balage Filho | Caroline Brun | Gilbert Rondeau
Proceedings of the Demonstration Session at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Propagation de polarités dans des familles de mots : impact de la morphologie dans la construction d’un lexique pour l’analyse de sentiments (Spreading Polarities among Word Families: Impact of Morphology on Building a Lexicon for Sentiment Analysis) [in French]
Núria Gala | Caroline Brun
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf
Opinion and Suggestion Analysis for Expert Recommendations
Anna Stavrianou | Caroline Brun
Proceedings of the Workshop on Semantic Analysis in Social Media

pdf
Linguistically-Adapted Structural Query Annotation for Digital Libraries in the Social Sciences
Caroline Brun | Vassilina Nikoulina | Nikolaos Lagos
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf
Learning Opinionated Patterns for Contextual Opinion Detection
Caroline Brun
Proceedings of COLING 2012: Posters

2011

pdf abs
Un système de détection d’opinions fondé sur l’analyse syntaxique profonde (An opinion detection system based on deep syntactic analysis)
Caroline Brun
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous présentons un système de détection d’opinions construit à partir des sorties d’un analyseur syntaxique robuste produisant des analyses profondes. L’objectif de ce système est l’extraction d’opinions associées à des produits (les concepts principaux) ainsi qu’aux concepts qui leurs sont associés (en anglais «features-based opinion extraction»). Suite à une étude d’un corpus cible, notre analyseur syntaxique est enrichi par l’ajout de polarité aux éléments pertinents du lexique et par le développement de règles génériques et spécialisées permettant l’extraction de relations sémantiques d’opinions, qui visent à alimenter un modèle de représentation des opinions. Une première évaluation montre des résultats très encourageants, mais de nombreuses perspectives restent à explorer.

pdf
Detecting Opinions Using Deep Syntactic Analysis
Caroline Brun
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

pdf abs
Un système de détection d’entités nommées adapté pour la campagne d’évaluation ESTER 2
Caroline Brun | Maud Ehrmann
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article nous relatons notre participation à la campagne d’évaluation ESTER 2 (Evaluation des Systèmes de Transcription Enrichie d’Emissions Radiophoniques). Après avoir décrit les objectifs de cette campagne ainsi que ses spécificités et difficultés, nous présentons notre système d’extraction d’entités nommées en nous focalisant sur les adaptations réalisées dans le cadre de cette campagne. Nous décrivons ensuite les résultats obtenus lors de la compétition, ainsi que des résultats originaux obtenus par la suite. Nous concluons sur les leçons tirées de cette expérience.

2009

Nous présentons une expérience de fusion d’annotations d’entités nommées provenant de différents annotateurs. Ce travail a été réalisé dans le cadre du projet Infom@gic, projet visant à l’intégration et à la validation d’applications opérationnelles autour de l’ingénierie des connaissances et de l’analyse de l’information, et soutenu par le pôle de compétitivité Cap Digital « Image, MultiMédia et Vie Numérique ». Nous décrivons tout d’abord les quatre annotateurs d’entités nommées à l’origine de cette expérience. Chacun d’entre eux fournit des annotations d’entités conformes à une norme développée dans le cadre du projet Infom@gic. L’algorithme de fusion des annotations est ensuite présenté ; il permet de gérer la compatibilité entre annotations et de mettre en évidence les conflits, et ainsi de fournir des informations plus fiables. Nous concluons en présentant et interprétant les résultats de la fusion, obtenus sur un corpus de référence annoté manuellement.

pdf
Résolution de métonymie des entités nommées : proposition d’une méthode hybride [Metonymy resolution for named entities: an hybrid approach]
Caroline Brun | Maud Ehrmann | Guillaume Jacquet
Traitement Automatique des Langues, Volume 50, Numéro 1 : Varia [Varia]

2008

pdf abs
Vérification sémantique pour l’annotation d’entités nommées
Caroline Brun | Caroline Hagège
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous proposons une méthode visant à corriger et à associer dynamiquement de nouveaux types sémantiques dans le cadre de systèmes de détection automatique d’entités nommées (EN). Après la détection des entités nommées et aussi de manière plus générale des noms propres dans les textes, une vérification de compatibilité de types sémantiques est effectuée non seulement pour confirmer ou corriger les résultats obtenus par le système de détection d’EN, mais aussi pour associer de nouveaux types non couverts par le système de détection d’EN. Cette vérification est effectuée en utilisant l’information syntaxique associée aux EN par un système d’analyse syntaxique robuste et en confrontant ces résultats avec la ressource sémantique WordNet. Les résultats du système de détection d’EN sont alors considérablement enrichis, ainsi que les étiquettes sémantiques associées aux EN, ce qui est particulièrement utile pour l’adaptation de systèmes de détection d’EN à de nouveaux domaines.

pdf abs
Résolution de Métonymie des Entités Nommées : proposition d’une méthode hybride
Caroline Brun | Maud Ehrmann | Guillaume Jacquet
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous décrivons la méthode que nous avons développée pour la résolution de métonymie des entités nommées dans le cadre de la compétition SemEval 2007. Afin de résoudre les métonymies sur les noms de lieux et noms d’organisation, tel que requis pour cette tâche, nous avons mis au point un système hybride basé sur l’utilisation d’un analyseur syntaxique robuste combiné avec une méthode d’analyse distributionnelle. Nous décrivons cette méthode ainsi que les résultats obtenus par le système dans le cadre de la compétition SemEval 2007.

2007

pdf
XRCE-M: A Hybrid System for Named Entity Metonymy Resolution
Caroline Brun | Maud Ehrmann | Guillaume Jacquet
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2004

pdf abs
Extraction d’information en domaine restreint pour la génération multilingue de résumés ciblés
Caroline Brun | Caroline Hagège
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article nous présentons une application de génération de résumés multilingues ciblés à partir de textes d’un domaine restreint. Ces résumés sont dits ciblés car ils sont produits d’après les spécifications d’un utilisateur qui doit décider a priori du type de l’information qu’il souhaite voir apparaître dans le résumé final. Pour mener à bien cette tâche, nous effectuons dans un premier temps l’extraction de l’information spécifiée par l’utilisateur. Cette information constitue l’entrée d’un système de génération multilingue qui produira des résumés normalisés en trois langues (anglais, français et espagnol) à partir d’un texte en anglais.

2003

pdf
Controlled Authoring of Biological Experiment Reports
Caroline Brun | Marc Dymetman | Eric Fanchon | Stanislas Lhomme
Demonstrations

pdf
Normalization and Paraphrasing Using Symbolic Methods
Caroline Brun | Caroline Hagège
Proceedings of the Second International Workshop on Paraphrasing

pdf abs
MDA-XML : une expérience de rédaction contrôlée multilingue basée sur XML
Guy Lapalme | Caroline Brun | Marc Dymetman
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Nous décrivons dans cet article l’implantation d’un système de rédaction contrôlée multilingue dans un environnement XML. Avec ce système, un auteur rédige interactivement un texte se conformant à des règles de bonne formation aux niveaux du contenu sémantique et de la réalisation linguistique décrites par un schéma XML. Nous discutons les avantages de cette approche ainsi que les difficultés rencontrées lors du développement de ce système. Nous concluons avec un exemple d’application à une classe de documents pharmaceutiques.