Erwan Moreau

2023

pdf abs
Improving Diachronic Word Sense Induction with a Nonparametric Bayesian method
Ashjan Alsulaimani | Erwan Moreau
Findings of the Association for Computational Linguistics: ACL 2023

Diachronic Word Sense Induction (DWSI) is the task of inducing the temporal representations of a word meaning from the context, as a set of senses and their prevalence over time. We introduce two new models for DWSI, based on topic modelling techniques: one is based on Hierarchical Dirichlet Processes (HDP), a nonparametric model; the other is based on the Dynamic Embedded Topic Model (DETM), a recent dynamic neural model. We evaluate these models against two state of the art DWSI models, using a time-stamped labelled dataset from the biomedical domain. We demonstrate that the two proposed models perform better than the state of the art. In particular, the HDP-based model drastically outperforms all the other models, including the dynamic neural model.

2020

pdf abs
An Evaluation Method for Diachronic Word Sense Induction
Ashjan Alsulaimani | Erwan Moreau | Carl Vogel
Findings of the Association for Computational Linguistics: EMNLP 2020

The task of Diachronic Word Sense Induction (DWSI) aims to identify the meaning of words from their context, taking the temporal dimension into account. In this paper we propose an evaluation method based on large-scale time-stamped annotated biomedical data, and a range of evaluation measures suited to the task. The approach is applied to two recent DWSI systems, thus demonstrating its relevance and providing an in-depth analysis of the models.

2018

pdf abs
CRF-Seq and CRF-DepTree at PARSEME Shared Task 2018: Detecting Verbal MWEs using Sequential and Dependency-Based Approaches
Erwan Moreau | Ashjan Alsulaimani | Alfredo Maldonado | Carl Vogel
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper describes two systems for detecting Verbal Multiword Expressions (VMWEs) which both competed in the closed track at the PARSEME VMWE Shared Task 2018. CRF-DepTree-categs implements an approach based on the dependency tree, intended to exploit the syntactic and semantic relations between tokens; CRF-Seq-nocategs implements a robust sequential method which requires only lemmas and morphosyntactic tags. Both systems ranked in the top half of the ranking, the latter ranking second for token-based evaluation. The code for both systems is published under the GNU General Public License version 3.0 and is available at http://github.com/erwanm/adapt-vmwe18.

pdf
Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus
Erwan Moreau | Carl Vogel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking
Alfredo Maldonado | Lifeng Han | Erwan Moreau | Ashjan Alsulaimani | Koel Dutta Chowdhury | Carl Vogel | Qun Liu
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition would struggle to beat a simple lookup baseline system and argue for a more purpose-specific evaluation scheme.

L’objectif de cet article est d’évaluer dans quelle mesure les “fonctions syntaxiques” qui figurent dans une partie du corpus arboré de Paris 7 sont apprenables à partir d’exemples. La technique d’apprentissage automatique employée pour cela fait appel aux “Champs Aléatoires Conditionnels” (Conditional Random Fields ou CRF), dans une variante adaptée à l’annotation d’arbres. Les expériences menées sont décrites en détail et analysées. Moyennant un bon paramétrage, elles atteignent une F1-mesure de plus de 80%.

2008

pdf abs
Appariement d’entités nommées coréférentes : combinaisons de mesures de similarité par apprentissage supervisé
Erwan Moreau | François Yvon | Olivier Cappé
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

L’appariement d’entités nommées consiste à regrouper les différentes formes sous lesquelles apparaît une entité. Pour cela, des mesures de similarité textuelle sont généralement utilisées. Nous proposons de combiner plusieurs mesures afin d’améliorer les performances de la tâche d’appariement. À l’aide d’expériences menées sur deux corpus, nous montrons la pertinence de l’apprentissage supervisé dans ce but, particulièrement avec l’algorithme C4.5.

2007

pdf abs
Apprentissage symbolique de grammaires et traitement automatique des langues
Erwan Moreau
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Le modèle de Gold formalise le processus d’apprentissage d’un langage. Nous présentons dans cet article les avantages et inconvénients de ce cadre théorique contraignant, dans la perspective d’applications en TAL. Nous décrivons brièvement les récentes avancées dans ce domaine, qui soulèvent selon nous certaines questions importantes.

2004

pdf abs
Apprentissage partiel de grammaires catégorielles
Erwan Moreau
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article traite de l’apprentissage symbolique de règles syntaxiques dans le modèle de Gold. Kanazawa a montré que certaines classes de grammaires catégorielles sont apprenables dans ce modèle. L’algorithme qu’il propose nécessite une grande quantité d’information en entrée pour être efficace. En changeant la nature des informations en entrée, nous proposons un algorithme d’apprentissage de grammaires catégorielles plus réaliste dans la perspective d’applications au langage naturel.