Guillaume Jacquet


Exploring Linguistically-Lightweight Keyword Extraction Techniques for Indexing News Articles in a Multilingual Set-up
Jakub Piskorski | Nicolas Stefanovitch | Guillaume Jacquet | Aldo Podavini
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

This paper presents a study of state-of-the-art unsupervised and linguistically unsophisticated keyword extraction algorithms, based on statistic-, graph-, and embedding-based approaches, including, i.a., Total Keyword Frequency, TF-IDF, RAKE, KPMiner, YAKE, KeyBERT, and variants of TextRank-based keyword extraction algorithms. The study was motivated by the need to select the most appropriate technique to extract keywords for indexing news articles in a real-world large-scale news analysis engine. The algorithms were evaluated on a corpus of circa 330 news articles in 7 languages. The overall best F1 scores for all languages on average were obtained using a combination of the recently introduced YAKE algorithm and KPMiner (20.1%, 46.6% and 47.2% for exact, partial and fuzzy matching resp.).

Fine-grained Event Classification in News-like Text Snippets - Shared Task 2, CASE 2021
Jacek Haneczok | Guillaume Jacquet | Jakub Piskorski | Nicolas Stefanovitch
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

This paper describes the Shared Task on Fine-grained Event Classification in News-like Text Snippets. The Shared Task is divided into three sub-tasks: (a) classification of text snippets reporting socio-political events (25 classes) for which vast amount of training data exists, although exhibiting different structure and style vis-a-vis test data, (b) enhancement to a generalized zero-shot learning problem, where 3 additional event types were introduced in advance, but without any training data (‘unseen’ classes), and (c) further extension, which introduced 2 additional event types, announced shortly prior to the evaluation phase. The reported Shared Task focuses on classification of events in English texts and is organized as part of the Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021), co-located with the ACL-IJCNLP 2021 Conference. Four teams participated in the task. Best performing systems for the three aforementioned sub-tasks achieved 83.9%, 79.7% and 77.1% weighted F1 scores respectively.


TF-IDF Character N-grams versus Word Embedding-based Models for Fine-grained Event Classification: A Preliminary Study
Jakub Piskorski | Guillaume Jacquet
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020

Automating the detection of event mentions in online texts and their classification vis-a-vis domain-specific event type taxonomies has been acknowledged by many organisations worldwide to be of paramount importance in order to facilitate the process of intelligence gathering. This paper reports on some preliminary experiments of comparing various linguistically-lightweight approaches for fine-grained event classification based on short text snippets reporting on events. In particular, we compare the performance of a TF-IDF-weighted character n-gram SVM-based model versus SVMs trained on various of-the-shelf pre-trained word embeddings (GloVe, BERT, FastText) as features. We exploit a relatively large event corpus consisting of circa 610K short text event descriptions classified using a 25-event categories that cover political violence and protest events. The best results, i.e., 83.5% macro and 92.4% micro F1 score, were obtained using the TF-IDF-weighted character n-gram model.

New Benchmark Corpus and Models for Fine-grained Event Classification: To BERT or not to BERT?
Jakub Piskorski | Jacek Haneczok | Guillaume Jacquet
Proceedings of the 28th International Conference on Computational Linguistics

We introduce a new set of benchmark datasets derived from ACLED data for fine-grained event classification and compare the performance of various state-of-the-art models on these datasets, including SVM based on TF-IDF character n-grams and neural context-free embeddings (GLOVE and FASTTEXT) as well as deep learning-based BERT with its contextual embeddings. The best results in terms of micro (94.3-94.9%) and macro F1 (86.0-88.9%) were obtained using BERT transformer, with simpler TF-IDF character n-gram based SVM being an interesting alternative. Further, we discuss the pros and cons of the considered benchmark models in terms of their robustness and the dependence of the classification performance on the size of training data.


JRC TMA-CC: Slavic Named Entity Recognition and Linking. Participation in the BSNLP-2019 shared task
Guillaume Jacquet | Jakub Piskorski | Hristo Tanev | Ralf Steinberger
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

We report on the participation of the JRC Text Mining and Analysis Competence Centre (TMA-CC) in the BSNLP-2019 Shared Task, which focuses on named-entity recognition, lemmatisation and cross-lingual linking. We propose a hybrid system combining a rule-based approach and light ML techniques. We use multilingual lexical resources such as JRC-NAMES and BABELNET together with a named entity guesser to recognise names. In a second step, we combine known names with wild cards to increase recognition recall by also capturing inflection variants. In a third step, we increase precision by filtering these name candidates with automatically learnt inflection patterns derived from name occurrences in large news article collections. Our major requirement is to achieve high precision. We achieved an average of 65% F-measure with 93% precision on the four languages.


pdf bib
Multi-word Entity Classification in a Highly Multilingual Environment
Sophie Chesney | Guillaume Jacquet | Ralf Steinberger | Jakub Piskorski
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

This paper describes an approach for the classification of millions of existing multi-word entities (MWEntities), such as organisation or event names, into thirteen category types, based only on the tokens they contain. In order to classify our very large in-house collection of multilingual MWEntities into an application-oriented set of entity categories, we trained and tested distantly-supervised classifiers in 43 languages based on MWEntities extracted from BabelNet. The best-performing classifier was the multi-class SVM using a TF.IDF-weighted data representation. Interestingly, one unique classifier trained on a mix of all languages consistently performed better than classifiers trained for individual languages, reaching an averaged F1-value of 88.8%. In this paper, we present the training and test data, including a human evaluation of its accuracy, describe the methods used to train the classifiers, and discuss the results.


Cross-lingual Linking of Multi-word Entities and their corresponding Acronyms
Guillaume Jacquet | Maud Ehrmann | Ralf Steinberger | Jaakko Väyrynen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper reports on an approach and experiments to automatically build a cross-lingual multi-word entity resource. Starting from a collection of millions of acronym/expansion pairs for 22 languages where expansion variants were grouped into monolingual clusters, we experiment with several aggregation strategies to link these clusters across languages. Aggregation strategies make use of string similarity distances and translation probabilities and they are based on vector space and graph representations. The accuracy of the approach is evaluated against Wikipedia’s redirection and cross-lingual linking tables. The resulting multi-word entity resource contains 64,000 multi-word entities with unique identifiers and their 600,000 multilingual lexical variants. We intend to make this new resource publicly available.


Named Entity Recognition on Turkish Tweets
Dilek Küçük | Guillaume Jacquet | Ralf Steinberger
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Various recent studies show that the performance of named entity recognition (NER) systems developed for well-formed text types drops significantly when applied to tweets. The only existing study for the highly inflected agglutinative language Turkish reports a drop in F-Measure from 91% to 19% when ported from news articles to tweets. In this study, we present a new named entity-annotated tweet corpus and a detailed analysis of the various tweet-specific linguistic phenomena. We perform comparative NER experiments with a rule-based multilingual NER system adapted to Turkish on three corpora: a news corpus, our new tweet corpus, and another tweet corpus. Based on the analysis and the experimentation results, we suggest system features required to improve NER results for social media like Twitter.

Clustering of Multi-Word Named Entity variants: Multilingual Evaluation
Guillaume Jacquet | Maud Ehrmann | Ralf Steinberger
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Multi-word entities, such as organisation names, are frequently written in many different ways. We have previously automatically identified over one million acronym pairs in 22 languages, consisting of their short form (e.g. EC) and their corresponding long forms (e.g. European Commission, European Union Commission). In order to automatically group such long form variants as belonging to the same entity, we cluster them, using bottom-up hierarchical clustering and pair-wise string similarity metrics. In this paper, we address the issue of how to evaluate the named entity variant clusters automatically, with minimal human annotation effort. We present experiments that make use of Wikipedia redirection tables and we show that this method produces good results.

Resource Creation and Evaluation for Multilingual Sentiment Analysis in Social Media Texts
Alexandra Balahur | Marco Turchi | Ralf Steinberger | Jose-Manuel Perea-Ortega | Guillaume Jacquet | Dilek Küçük | Vanni Zavarella | Adil El Ghali
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents an evaluation of the use of machine translation to obtain and employ data for training multilingual sentiment classifiers. We show that the use of machine translated data obtained similar results as the use of native-speaker translations of the same data. Additionally, our evaluations pinpoint to the fact that the use of multilingual data, including that obtained through machine translation, leads to improved results in sentiment classification. Finally, we show that the performance of the sentiment classifiers built on machine translated data can be improved using original data from the target language and that even a small amount of such texts can lead to significant growth in the classification performance.


Clique-Based Clustering for Improving Named Entity Recognition Systems
Julien Ah-Pine | Guillaume Jacquet
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

Une expérience de fusion pour l’annotation d’entités nommées
Caroline Brun | Nicolas Dessaigne | Maud Ehrmann | Baptiste Gaillard | Sylvie Guillemin-Lanne | Guillaume Jacquet | Aaron Kaplan | Marianna Kucharski | Claude Martineau | Aurélie Migeotte | Takuya Nakamura | Stavroula Voyatzi
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous présentons une expérience de fusion d’annotations d’entités nommées provenant de différents annotateurs. Ce travail a été réalisé dans le cadre du projet Infom@gic, projet visant à l’intégration et à la validation d’applications opérationnelles autour de l’ingénierie des connaissances et de l’analyse de l’information, et soutenu par le pôle de compétitivité Cap Digital « Image, MultiMédia et Vie Numérique ». Nous décrivons tout d’abord les quatre annotateurs d’entités nommées à l’origine de cette expérience. Chacun d’entre eux fournit des annotations d’entités conformes à une norme développée dans le cadre du projet Infom@gic. L’algorithme de fusion des annotations est ensuite présenté ; il permet de gérer la compatibilité entre annotations et de mettre en évidence les conflits, et ainsi de fournir des informations plus fiables. Nous concluons en présentant et interprétant les résultats de la fusion, obtenus sur un corpus de référence annoté manuellement.


Résolution de Métonymie des Entités Nommées : proposition d’une méthode hybride
Caroline Brun | Maud Ehrmann | Guillaume Jacquet
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous décrivons la méthode que nous avons développée pour la résolution de métonymie des entités nommées dans le cadre de la compétition SemEval 2007. Afin de résoudre les métonymies sur les noms de lieux et noms d’organisation, tel que requis pour cette tâche, nous avons mis au point un système hybride basé sur l’utilisation d’un analyseur syntaxique robuste combiné avec une méthode d’analyse distributionnelle. Nous décrivons cette méthode ainsi que les résultats obtenus par le système dans le cadre de la compétition SemEval 2007.


XRCE-M: A Hybrid System for Named Entity Metonymy Resolution
Caroline Brun | Maud Ehrmann | Guillaume Jacquet
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)


Construction automatique de classes de sélection distributionnelle
Guillaume Jacquet | Fabienne Venant
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cette étude se place dans le cadre général de la désambiguïsation automatique du sens d’un Verbe dans un énoncé donné. Notre méthode de désambiguïsation prend en compte la construction du Verbe, c’est-à-dire l’influence des éléments lexicaux et syntaxiques présents dans l’énoncé (cotexte). Nous cherchons maintenant à finaliser cette méthode en tenant compte des caractéristiques sémantiques du cotexte. Pour ce faire nous associons au corpus un espace distributionnel continu dans lequel nous construisons et Visualisons des classes distributionnelles. La singularité de ces classes est qu’elles sont calculées à la Volée. Elles dépendent donc non seulement du corpus mais aussi du contexte étudié. Nous présentons ici notre méthode de calcul de classes ainsi que les premiers résultats obtenus.


Polysémie verbale et construction syntaxique : étude sur le verbe jouer
Guillaume Jacquet
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

Dans l’analyse sémantique de textes, un des obstacles au TAL est la polysémie des unités linguistiques. Par exemple, le sens du verbe jouer peut varier en fonction du contexte : Il joue de la trompette (pratiquer) ; Il joue avec son fils (s’amuser). Une des approches pour traiter ces ambiguïtés de sens, est le modèle de la construction dynamique du sens proposé par B. Victorri et C. Fuchs (1996). Dans ce modèle, on associe à chaque unité polysémique un espace sémantique, et le sens de l’unité dans un énoncé donné est le résultat d’une interaction dynamique avec les autres unités présentes dans l’énoncé. Nous voulons montrer ici que les constructions verbales sont des éléments du co-texte qui contribuent, au même titre que le co-texte lexical, au processus dynamique de construction du sens du verbe. L’objectif est alors de montrer que les constructions verbales sont porteuses de sens intrinsèque (Goldberg, 1995) et qu’elles permettent dans notre modèle de contraindre automatiquement le sens d’un verbe.