Matthieu Constant

Also published as: Mathieu Constant

2022

pdf bib abs
Semeval-2022 Task 1: CODWOE – Comparing Dictionaries and Word Embeddings
Timothee Mickus | Kees Van Deemter | Mathieu Constant | Denis Paperno
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Word embeddings have advanced the state of the art in NLP across numerous tasks. Understanding the contents of dense neural representations is of utmost interest to the computational semantics community. We propose to focus on relating these opaque word vectors with human-readable definitions, as found in dictionaries This problem naturally divides into two subtasks: converting definitions into embeddings, and converting embeddings into definitions. This task was conducted in a multilingual setting, using comparable sets of embeddings trained homogeneously.

pdf abs
Word Sense Disambiguation of French Lexicographical Examples Using Lexical Networks
Aman Sinha | Sandrine Ollinger | Mathieu Constant
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

This paper focuses on the task of word sense disambiguation (WSD) on lexicographic examples relying on the French Lexical Network (fr-LN). For this purpose, we exploit the lexical and relational properties of the network, that we integrated in a feedforward neural WSD model on top of pretrained French BERT embeddings. We provide a comparative study with various models and further show the impact of our approach regarding polysemic units.

pdf abs
How to Dissect a Muppet: The Structure of Transformer Embedding Spaces
Timothee Mickus | Denis Paperno | Mathieu Constant
Transactions of the Association for Computational Linguistics, Volume 10

Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.

pdf abs
IAI @ SocialDisNER : Catch me if you can! Capturing complex disease mentions in tweets
Aman Sinha | Cristina Garcia Holgado | Marianne Clausel | Matthieu Constant
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

Biomedical NER is an active research area today. Despite the availability of state-of-the-art models for standard NER tasks, their performance degrades on biomedical data due to OOV entities and the challenges encountered in specialized domains. We use Flair-NER framework to investigate the effectiveness of various contextual and static embeddings for NER on Spanish tweets, in particular, to capture complex disease mentions.

2021

pdf abs
Évaluation de méthodes et d’outils pour la lemmatisation automatique du français médiéval (Evaluation of methods and tools for automatic lemmatization in Old French)
Cristina Holgado | Alexei Lavrentiev | Mathieu Constant
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Pour les langues historiques non stabilisées comme le français médiéval, la lemmatisation automatique présente toujours des défis, car cette langue connaît une forte variation graphique. Dans cet article, nous dressons un état des lieux de la lemmatisation automatique pour cette langue en comparant les performances de quatre lemmatiseurs existants sur un même jeu de données. L’objectif est d’évaluer où se situent les nouvelles techniques de l’apprentissage automatique par rapport aux techniques plus traditionnelles s’appuyant sur des systèmes de règles et lexiques, en particulier pour la prédiction des mots inconnus.

2020

pdf abs
Rigor Mortis: Annotating MWEs with a Gamified Platform
Karën Fort | Bruno Guillaume | Yann-Alan Pilatte | Mathieu Constant | Nicolas Lefèbvre
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present here Rigor Mortis, a gamified crowdsourcing platform designed to evaluate the intuition of the speakers, then train them to annotate multi-word expressions (MWEs) in French corpora. We previously showed that the speakers’ intuition is reasonably good (65% in recall on non-fixed MWE). We detail here the annotation results, after a training phase using some of the tests developed in the PARSEME-FR project.

pdf
What do you mean, BERT?
Timothee Mickus | Denis Paperno | Mathieu Constant | Kees van Deemter
Proceedings of the Society for Computation in Linguistics 2020

pdf abs
Génération automatique de définitions pour le français (Definition Modeling in French)
Timothee Mickus | Mathieu Constant | Denis Paperno
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

La génération de définitions est une tâche récente qui vise à produire des définitions lexicographiques à partir de plongements lexicaux. Nous remarquons deux lacunes : (i) l’état de l’art actuel ne s’est penché que sur l’anglais et le chinois, et (ii) l’utilisation escomptée en tant que méthode d’évaluation des plongements lexicaux doit encore être vérifiée. Pour y remédier, nous proposons un jeu de données pour la génération de définitions en français, ainsi qu’une évaluation des performances d’un modèle de génération de définitions simple selon les plongements lexicaux fournis en entrée.

2019

pdf abs
Démonstrateur en-ligne du projet ANR PARSEME-FR sur les expressions polylexicales (On-line demonstrator of the PARSEME-FR project on multiword expressions)
Marine Schmitt | Elise Moreau | Mathieu Constant | Agata Savary
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume IV : Démonstrations

Nous présentons le démonstrateur en-ligne du projet ANR PARSEME-FR dédié aux expressions polylexicales. Il inclut différents outils d’identification de telles expressions et un outil d’exploration des ressources linguistiques de ce projet.

pdf abs
Neural Lemmatization of Multiword Expressions
Marine Schmitt | Mathieu Constant
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

This article focuses on the lemmatization of multiword expressions (MWEs). We propose a deep encoder-decoder architecture generating for every MWE word its corresponding part in the lemma, based on the internal context of the MWE. The encoder relies on recurrent networks based on (1) the character sequence of the individual words to capture their morphological properties, and (2) the word sequence of the MWE to capture lexical and syntactic properties. The decoder in charge of generating the corresponding part of the lemma for each word of the MWE is based on a classical character-level attention-based recurrent model. Our model is evaluated for Italian, French, Polish and Portuguese and shows good performances except for Polish.

pdf abs
Comparing linear and neural models for competitive MWE identification
Hazem Al Saied | Marie Candito | Mathieu Constant
Proceedings of the 22nd Nordic Conference on Computational Linguistics

In this paper, we compare the use of linear versus neural classifiers in a greedy transition system for MWE identification. Both our linear and neural models achieve a new state-of-the-art on the PARSEME 1.1 shared task data sets, comprising 20 languages. Surprisingly, our best model is a simple feed-forward network with one hidden layer, although more sophisticated (recurrent) architectures were tested. The feedback from this study is that tuning a SVM is rather straightforward, whereas tuning our neural system revealed more challenging. Given the number of languages and the variety of linguistic phenomena to handle for the MWE identification task, we have designed an accurate tuning procedure, and we show that hyperparameters are better selected by using a majority-vote within random search configurations rather than a simple best configuration selection. Although the performance is rather good (better than both the best shared task system and the average of the best per-language results), further work is needed to improve the generalization power, especially on unseen MWEs.

pdf bib abs
Mark my Word: A Sequence-to-Sequence Approach to Definition Modeling
Timothee Mickus | Denis Paperno | Matthieu Constant
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

Defining words in a textual context is a useful task both for practical purposes and for gaining insight into distributed word representations. Building on the distributional hypothesis, we argue here that the most natural formalization of definition modeling is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it. We implement this approach in a Transformer-based sequence-to-sequence model. Our proposal allows to train contextualization and definition generation in an end-to-end fashion, which is a conceptual improvement over earlier works. We achieve state-of-the-art results both in contextual and non-contextual definition modeling.

2018

pdf abs
“Fingers in the Nose”: Evaluating Speakers’ Identification of Multi-Word Expressions Using a Slightly Gamified Crowdsourcing Platform
Karën Fort | Bruno Guillaume | Matthieu Constant | Nicolas Lefèbvre | Yann-Alan Pilatte
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This article presents the results we obtained in crowdsourcing French speakers’ intuition concerning multi-work expressions (MWEs). We developed a slightly gamified crowdsourcing platform, part of which is designed to test users’ ability to identify MWEs with no prior training. The participants perform relatively well at the task, with a recall reaching 65% for MWEs that do not behave as function words.

2017

pdf bib abs
Annotation d’expressions polylexicales verbales en français (Annotation of verbal multiword expressions in French)
Marie Candito | Mathieu Constant | Carlos Ramisch | Agata Savary | Yannick Parmentier | Caroline Pasquer | Jean-Yves Antoine
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Nous décrivons la partie française des données produites dans le cadre de la campagne multilingue PARSEME sur l’identification d’expressions polylexicales verbales (Savary et al., 2017). Les expressions couvertes pour le français sont les expressions verbales idiomatiques, les verbes intrinsèquement pronominaux et une généralisation des constructions à verbe support. Ces phénomènes ont été annotés sur le corpus French-UD (Nivre et al., 2016) et le corpus Sequoia (Candito & Seddah, 2012), soit un corpus de 22 645 phrases, pour un total de 4 962 expressions annotées. On obtient un ratio d’une expression annotée tous les 100 tokens environ, avec un fort taux d’expressions discontinues (40%).

pdf abs
The ATILF-LLF System for Parseme Shared Task: a Transition-based Verbal Multiword Expression Tagger
Hazem Al Saied | Matthieu Constant | Marie Candito
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

We describe the ATILF-LLF system built for the MWE 2017 Shared Task on automatic identification of verbal multiword expressions. We participated in the closed track only, for all the 18 available languages. Our system is a robust greedy transition-based system, in which MWE are identified through a MERGE transition. The system was meant to accommodate the variety of linguistic resources provided for each language, in terms of accompanying morphological and syntactic information. Using per-MWE Fscore, the system was ranked first for all but two languages (Hungarian and Romanian).

pdf abs
Benchmarking Joint Lexical and Syntactic Analysis on Multiword-Rich Data
Matthieu Constant | Héctor Martinez Alonso
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

This article evaluates the extension of a dependency parser that performs joint syntactic analysis and multiword expression identification. We show that, given sufficient training data, the parser benefits from explicit multiword information and improves overall labeled accuracy score in eight of the ten evaluation cases.

Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by “MWE processing,” distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives.

2016

pdf
Deep Lexical Segmentation and Syntactic Parsing in the Easy-First Dependency Framework
Matthieu Constant | Joseph Le Roux | Nadi Tomeh
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf abs
Improvement of VerbNet-like resources by frame typing
Laurence Danlos | Matthieu Constant | Lucie Barque
Proceedings of the Workshop on Grammar and Lexicon: interactions and interfaces (GramLex)

Verbenet is a French lexicon developed by “translation” of its English counterpart — VerbNet (Kipper-Schuler, 2005)—and treatment of the specificities of French syntax (Pradet et al., 2014; Danlos et al., 2016). One difficulty encountered in its development springs from the fact that the list of (potentially numerous) frames has no internal organization. This paper proposes a type system for frames that shows whether two frames are variants of a given alternation. Frame typing facilitates coherence checking of the resource in a “virtuous circle”. We present the principles underlying a program we developed and used to automatically type frames in VerbeNet. We also show that our system is portable to other languages.

pdf
A Transition-Based System for Joint Lexical and Syntactic Analysis
Matthieu Constant | Joakim Nivre
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf abs
Analyse syntaxique de l’ancien français : quelles propriétés de la langue influent le plus sur la qualité de l’apprentissage ?
Gaël Guibon | Isabelle Tellier | Sophie Prévost | Matthieu Constant | Kim Gerdes
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’article présente des résultats d’expériences d’apprentissage automatique pour l’étiquetage morpho-syntaxique et l’analyse syntaxique en dépendance de l’ancien français. Ces expériences ont pour objectif de servir une exploration de corpus pour laquelle le corpus arboré SRCMF sert de données de référence. La nature peu standardisée de la langue qui y est utilisée implique des données d’entraînement hétérogènes et quantitativement limitées. Nous explorons donc diverses stratégies, fondées sur différents critères (variabilité du lexique, forme Vers/Prose des textes, dates des textes), pour constituer des corpus d’entrainement menant aux meilleurs résultats possibles.

In this article, we describe a new sense-tagged corpus for Word Sense Disambiguation. The corpus is constituted of instances of 20 French polysemous verbs. Each verb instance is annotated with three sense labels: (1) the actual translation of the verb in the english version of this instance in a parallel corpus, (2) an entry of the verb in a computational dictionary of French (the Lexicon-Grammar tables) and (3) a fine-grained sense label resulting from the concatenation of the translation and the Lexicon-Grammar entry.

pdf abs
Evaluating the Impact of External Lexical Resources into a CRF-based Multiword Segmenter and Part-of-Speech Tagger
Matthieu Constant | Isabelle Tellier
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper evaluates the impact of external lexical resources into a CRF-based joint Multiword Segmenter and Part-of-Speech Tagger. We especially show different ways of integrating lexicon-based features in the tagging model. We display an absolute gain of 0.5% in terms of f-measure. Moreover, we show that the integration of lexicon-based features significantly compensates the use of a small training corpus.

pdf abs
Extending the adverbial coverage of a French morphological lexicon
Elsa Tolone | Stavroula Voyatzi | Claude Martineau | Matthieu Constant
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present an extension of the adverbial entries of the French morphological lexicon DELA (Dictionnaires Electroniques du LADL / LADL electronic dictionaries). Adverbs were extracted from LGLex, a NLP-oriented syntactic resource for French, which in its turn contains all adverbs extracted from the Lexicon-Grammar tables of both simple adverbs ending in -ment (i.e., '-ly') and compound adverbs. This work exploits fine-grained linguistic information provided in existing resources. The resulting resource is reviewed in order to delete duplicates and is freely available under the LGPL-LR license.

2011

pdf abs
Intégrer des connaissances linguistiques dans un CRF : application à l’apprentissage d’un segmenteur-étiqueteur du français (Integrating linguistic knowledge in a CRF: application to learning a segmenter-tagger of French)
Matthieu Constant | Isabelle Tellier | Denys Duchier | Yoann Dupont | Anthony Sigogne | Sylvie Billot
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous synthétisons les résultats de plusieurs séries d’expériences réalisées à l’aide de CRF (Conditional Random Fields ou “champs markoviens conditionnels”) linéaires pour apprendre à annoter des textes français à partir d’exemples, en exploitant diverses ressources linguistiques externes. Ces expériences ont porté sur l’étiquetage morphosyntaxique intégrant l’identification des unités polylexicales. Nous montrons que le modèle des CRF est capable d’intégrer des ressources lexicales riches en unités multi-mots de différentes manières et permet d’atteindre ainsi le meilleur taux de correction d’étiquetage actuel pour le français.

pdf
Integration of Data from a Syntactic Lexicon into Generative and Discriminative Probabilistic Parsers
Anthony Sigogne | Matthieu Constant | Éric Laporte
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf
MWU-Aware Part-of-Speech Tagging with a CRF Model and Lexical Resources
Matthieu Constant | Anthony Sigogne
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf
French parsing enhanced with a word clustering method based on a syntactic lexicon
Anthony Sigogne | Matthieu Constant | Éric Laporte
Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages

pdf bib
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing
Andreas Maletti | Matthieu Constant
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing

2010

pdf abs
Partial Parsing of Spontaneous Spoken French
Olivier Blanc | Matthieu Constant | Anne Dister | Patrick Watrin
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the process and the resources used to automatically annotate a French corpus of spontaneous speech transcriptions in super-chunks. Super-chunks are enhanced chunks that can contain lexical multiword units. This partial parsing is based on a preprocessing stage of the spoken data that consists in reformatting and tagging utterances that break the syntactic structure of the text, such as disfluencies. Spoken specificities were formalized thanks to a systematic linguistic study of a 40-hour-long speech transcription corpus. The chunker uses large-coverage and fine-grained language resources for general written language that have been augmented with resources specific to spoken French. It consists in iteratively applying finite-state lexical and syntactic resources and outputing a finite automaton representing all possible chunk analyses. The best path is then selected thanks to a hybrid disambiguation stage. We show that our system reaches scores that are comparable with state-of-the-art results in the field.

pdf abs
Evaluating the Impact of Some Linguistic Information on the Performances of a Similarity-based and Translation-oriented Word-Sense Disambiguation Method
Myriam Rakho | Matthieu Constant
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this article, we present an experiment of linguistic parameter tuning in the representation of the semantic space of polysemous words. We evaluate quantitatively the influence of some basic linguistic knowledge (lemmas, multi-word expressions, grammatical tags and syntactic relations) on the performances of a similarity-based Word-Sense disambiguation method. The question we try to answer, by this experiment, is which kinds of linguistic knowledge are most useful for the semantic disambiguation of polysemous words, in a multilingual framework. The experiment is about 20 French polysemous words (16 nouns and 4 verbs) and we make use of the French-English part of the sentence-aligned EuroParl Corpus for training and testing. Our results show a strong correlation between the system accuracy and the degree of precision of the linguistic features used, particularly the syntactic dependency relations. Furthermore, the lemma-based approach absolutely outperforms the word form-based approach. The best accuracy achieved by our system amounts to 90%.

2007

pdf abs
Segmentation en super-chunks
Olivier Blanc | Matthieu Constant | Patrick Watrin
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Depuis l’analyseur développé par Harris à la fin des années 50, les unités polylexicales ont peu à peu été intégrées aux analyseurs syntaxiques. Cependant, pour la plupart, elles sont encore restreintes aux mots composés qui sont plus stables et moins nombreux. Toutefois, la langue est remplie d’expressions semi-figées qui forment également des unités sémantiques : les expressions adverbiales et les collocations. De même que pour les mots composés traditionnels, l’identification de ces structures limite la complexité combinatoire induite par l’ambiguïté lexicale. Dans cet article, nous détaillons une expérience qui intègre ces notions dans un processus de segmentation en super-chunks, préalable à l’analyse syntaxique. Nous montrons que notre chunker, développé pour le français, atteint une précision et un rappel de 92,9 % et 98,7 %, respectivement. Par ailleurs, les unités polylexicales réalisent 36,6 % des attachements internes aux constituants nominaux et prépositionnels.

2006

pdf abs
Outilex, plate-forme logicielle de traitement de textes écrits
Olivier Blanc | Matthieu Constant | Éric Laporte
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

La plate-forme logicielle Outilex, qui sera mise à la disposition de la recherche, du développement et de l’industrie, comporte des composants logiciels qui effectuent toutes les opérations fondamentales du traitement automatique du texte écrit : traitements sans lexiques, exploitation de lexiques et de grammaires, gestion de ressources linguistiques. Les données manipulées sont structurées dans des formats XML, et également dans d’autres formats plus compacts, soit lisibles soit binaires, lorsque cela est nécessaire ; les convertisseurs de formats nécessaires sont inclus dans la plate-forme ; les formats de grammaires permettent de combiner des méthodes statistiques avec des méthodes fondées sur des ressources linguistiques. Enfin, des lexiques du français et de l’anglais issus du LADL, construits manuellement et d’une couverture substantielle seront distribués avec la plate-forme sous licence LGPL-LR.

pdf
Outilex, a Linguistic Platform for Text Processing
Olivier Blanc | Matthieu Constant
Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions

2002

pdf
Methods for Constructing Lexicon-Grammar Resources: The Example of Measure Expressions
Matthieu Constant
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

pdf abs
Bibliothèques d’automates finis et grammaires context-free : de nouveaux traitements informatiques
Matthieu Constant
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

La quantité de documents disponibles via Internet explose. Cette situation nous incite à rechercher de nouveaux outils de localisation d’information dans des documents et, en particulier, à nous pencher sur l’algorithmique des grammaires context-free appliquée à des familles de graphes d’automates finis (strictement finis ou à cycles). Nous envisageons une nouvelle représentation et de nouveaux traitements informatiques sur ces grammaires, afin d’assurer un accès rapide aux données et un stockage peu coûteux en mémoire.