Thierry Poibeau

2023

pdf abs
On the Correspondence between Compositionality and Imitation in Emergent Neural Communication
Emily Cheng | Mathieu Rita | Thierry Poibeau
Findings of the Association for Computational Linguistics: ACL 2023

Compositionality is a hallmark of human language that not only enables linguistic generalization, but also potentially facilitates acquisition. When simulating language emergence with neural networks, compositionality has been shown to improve communication performance; however, its impact on imitation learning has yet to be investigated. Our work explores the link between compositionality and imitation in a Lewis game played by deep neural agents. Our contributions are twofold: first, we show that the learning algorithm used to imitate is crucial: supervised learning tends to produce more average languages, while reinforcement learning introduces a selection pressure toward more compositional languages. Second, our study reveals that compositional languages are easier to imitate, which may induce the pressure toward compositional languages in RL imitation settings.

pdf bib abs
Quelques observations sur la notion de biais dans les modèles de langue
Romane Gallienne | Thierry Poibeau
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 3 : prises de position en TAL

Cet article revient sur la notion de biais dans les modèles de langue. On montre à partir d’exemples tirés de modèles génératifs pour le français (de type GPT) qu’il est facile d’orienter, à partir de prompts précis, les textes générés vers des résultats potentiellement problématiques (avec des stéréotypes, des biais, etc.). Mais les actions à accomplir à partir de là ne sont pas neutres : le fait de débiaiser les modèles a un aspect positif mais pose aussi de nombreuses questions (comment décider ce qu’il faut corriger ? qui peut ou doit le décider ? par rapport à quelle norme?). Finalement, on montre que les questions posées ne sont pas seulement technologiques, mais avant tout sociales, et liées au contexte d’utilisation des applications visées.

2022

pdf abs
Probing for the Usage of Grammatical Number
Karim Lasri | Tiago Pimentel | Alessandro Lenci | Thierry Poibeau | Ryan Cotterell
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A central quest of probing is to uncover how pre-trained models encode a linguistic property within their representations. An encoding, however, might be spurious—i.e., the model might not rely on it when making predictions. In this paper, we try to find an encoding that the model actually uses, introducing a usage-based probing setup. We first choose a behavioral task which cannot be solved without using the linguistic property. Then, we attempt to remove the property by intervening on the model’s representations. We contend that, if an encoding is used by the model, its removal should harm the performance on the chosen behavioral task. As a case study, we focus on how BERT encodes grammatical number, and on how it uses this encoding to solve the number agreement task. Experimentally, we find that BERT relies on a linear encoding of grammatical number to produce the correct behavioral output. We also find that BERT uses a separate encoding of grammatical number for nouns and verbs. Finally, we identify in which layers information about grammatical number is transferred from a noun to its head verb.

pdf abs
On “Human Parity” and “Super Human Performance” in Machine Translation Evaluation
Thierry Poibeau
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we reassess claims of human parity and super human performance in machine translation. Although these terms have already been discussed, as well as the evaluation protocols used to achieved these conclusions (human-parity is achieved i) only for a very reduced number of languages, ii) on very specific types of documents and iii) with very literal translations), we show that the terms used are themselves problematic, and that human translation involves much more than what is embedded in automatic systems. We also discuss ethical issues related to the way results are presented and advertised. Finally, we claim that a better assessment of human capacities should be put forward and that the goal of replacing humans by machines is not a desirable one.

pdf
Automatic Generation of Factual News Headlines in Finnish
Maximilian Koppatz | Khalid Alnajjar | Mika Hämäläinen | Thierry Poibeau
Proceedings of the 15th International Conference on Natural Language Generation

pdf abs
Does BERT really agree ? Fine-grained Analysis of Lexical Dependence on a Syntactic Task
Karim Lasri | Alessandro Lenci | Thierry Poibeau
Findings of the Association for Computational Linguistics: ACL 2022

Although transformer-based Neural Language Models demonstrate impressive performance on a variety of tasks, their generalization abilities are not well understood. They have been shown to perform strongly on subject-verb number agreement in a wide array of settings, suggesting that they learned to track syntactic dependencies during their training even without explicit supervision. In this paper, we examine the extent to which BERT is able to perform lexically-independent subject-verb number agreement (NA) on targeted syntactic templates. To do so, we disrupt the lexical patterns found in naturally occurring stimuli for each targeted structure in a novel fine-grained analysis of BERT’s behavior. Our results on nonce sentences suggest that the model generalizes well for simple templates, but fails to perform lexically-independent syntactic generalization when as little as one attractor is present.

pdf abs
Word Order Matters When You Increase Masking
Karim Lasri | Alessandro Lenci | Thierry Poibeau
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Word order, an essential property of natural languages, is injected in Transformer-based neural language models using position encoding. However, recent experiments have shown that explicit position encoding is not always useful, since some models without such feature managed to achieve state-of-the art performance on some tasks. To understand better this phenomenon, we examine the effect of removing position encodings on the pre-training objective itself (i.e., masked language modelling), to test whether models can reconstruct position information from co-occurrences alone. We do so by controlling the amount of masked tokens in the input sentence, as a proxy to affect the importance of position information for the task. We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task. These findings point towards a direct relationship between the amount of masking and the ability of Transformers to capture order-sensitive aspects of language using position encoding.

pdf abs
Subject Verb Agreement Error Patterns in Meaningless Sentences: Humans vs. BERT
Karim Lasri | Olga Seminck | Alessandro Lenci | Thierry Poibeau
Proceedings of the 29th International Conference on Computational Linguistics

Both humans and neural language models are able to perform subject verb number agreement (SVA). In principle, semantics shouldn’t interfere with this task, which only requires syntactic knowledge. In this work we test whether meaning interferes with this type of agreement in English in syntactic structures of various complexities. To do so, we generate both semantically well-formed and nonsensical items. We compare the performance of BERT-base to that of humans, obtained with a psycholinguistic online crowdsourcing experiment. We find that BERT and humans are both sensitive to our semantic manipulation: They fail more often when presented with nonsensical items, especially when their syntactic structure features an attractor (a noun phrase between the subject and the verb that has not the same number as the subject). We also find that the effect of meaningfulness on SVA errors is stronger for BERT than for humans, showing higher lexical sensitivity of the former on this task.

2021

pdf abs
Text Zoning of Theater Reviews: How Different are Journalistic from Blogger Reviews?
Mylene Maignant | Thierry Poibeau | Gaëtan Brison
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

This paper aims at modeling the structure of theater reviews based on contemporary London performances by using text zoning. Text zoning consists in tagging sentences so as to reveal text structure. More than 40 000 theater reviews going from 2010 to 2020 were collected to analyze two different types of reception (journalistic vs digital). We present our annotation scheme and the classifiers used to perform the text zoning task, aiming at tagging reviews at the sentence level. We obtain the best results using the random forest algorithm, and show that this approach makes it possible to give a first insight of the similarities and differences between our two subcorpora.

2020

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex–style resources for additional languages. We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via a Web site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

pdf abs
Sonnet Combinatorics with OuPoCo
Thierry Poibeau | Mylène Maignant | Frédérique Mélanie-Becquet | Clément Plancq | Matthieu Raffard | Mathilde Roussel
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In this paper, we describe OuPoCo, a system producing new sonnets by recombining verses from existing sonnets, following an idea that Queneau described in his book “Cent Mille Milliards de poèmes, Gallimard”, 1961. We propose to demonstrate different outputs of our implementation (a Web site, a Twitter bot and a specifically developed device, called ‘La Boîte à poésie’) based on a corpus of 19th century French poetry. Our goal is to make people interested in poetry again, by giving access to automatically produced sonnets through original and entertaining channels and devices.

2019

Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-utilization of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such an approach could be facilitated by recent developments in data-driven induction of typological knowledge.

2018

pdf abs
SEx BiST: A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations
KyungTae Lim | Cheoneum Park | Changki Lee | Thierry Poibeau
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We describe the SEx BiST parser (Semantically EXtended Bi-LSTM parser) developed at Lattice for the CoNLL 2018 Shared Task (Multilingual Parsing from Raw Text to Universal Dependencies). The main characteristic of our work is the encoding of three different modes of contextual information for parsing: (i) Treebank feature representations, (ii) Multilingual word representations, (iii) ELMo representations obtained via unsupervised learning from external resources. Our parser performed well in the official end-to-end evaluation (73.02 LAS – 4th/26 teams, and 78.72 UAS – 2nd/26); remarkably, we achieved the best UAS scores on all the English corpora by applying the three suggested feature representations. Finally, we were also ranked 1st at the optional event extraction task, part of the 2018 Extrinsic Parser Evaluation campaign.

pdf bib
Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations
Niko Partanen | Kyungtae Lim | Michael Rießler | Thierry Poibeau
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing
Marco Idiart | Alessandro Lenci | Thierry Poibeau | Aline Villavicencio
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing

pdf abs
The First Komi-Zyrian Universal Dependencies Treebanks
Niko Partanen | Rogier Blokland | KyungTae Lim | Thierry Poibeau | Michael Rießler
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Two Komi-Zyrian treebanks were included in the Universal Dependencies 2.2 release. This article contextualizes the treebanks, discusses the process through which they were created, and outlines the future plans and timeline for the next improvements. Special attention is paid to the possibilities of using UD in the documentation and description of endangered languages.

pdf
Analyse syntaxique de langues faiblement dotées à partir de plongements de mots multilingues [Syntactic analysis of under-resourced languages from multilingual word embeddings]
KyungTae Lim | Niko Partanen | Thierry Poibeau
Traitement Automatique des Langues, Volume 59, Numéro 3 : Traitement automatique des langues peu dotées [NLP for Under-Resourced Languages]

pdf
Multilingual Dependency Parsing for Low-Resource Languages: Case Studies on North Saami and Komi-Zyrian
KyungTae Lim | Niko Partanen | Thierry Poibeau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
Preliminary Experiments concerning Verbal Predicative Structure Extraction from a Large Finnish Corpus
Guersande Chaminade | Thierry Poibeau
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

pdf abs
Enjambment Detection in a Large Diachronic Corpus of Spanish Sonnets
Pablo Ruiz Fabo | Clara Martínez Cantón | Thierry Poibeau | Elena González-Blanco
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Enjambment takes place when a syntactic unit is broken up across two lines of poetry, giving rise to different stylistic effects. In Spanish literary studies, there are unclear points about the types of stylistic effects that can arise, and under which linguistic conditions. To systematically gather evidence about this, we developed a system to automatically identify enjambment (and its type) in Spanish. For evaluation, we manually annotated a reference corpus covering different periods. As a scholarly corpus to apply the tool, from public HTML sources we created a diachronic corpus covering four centuries of sonnets (3750 poems), and we analyzed the occurrence of enjambment across stanzaic boundaries in different periods. Besides, we found examples that highlight limitations in current definitions of enjambment.

pdf
UDLex: Towards Cross-language Subcategorization Lexicons
Giulia Rambelli | Alessandro Lenci | Thierry Poibeau
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf abs
A System for Multilingual Dependency Parsing based on Bidirectional LSTM Feature Representations
KyungTae Lim | Thierry Poibeau
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper, we present our multilingual dependency parser developed for the CoNLL 2017 UD Shared Task dealing with “Multilingual Parsing from Raw Text to Universal Dependencies”. Our parser extends the monolingual BIST-parser as a multi-source multilingual trainable parser. Thanks to multilingual word embeddings and one hot encodings for languages, our system can use both monolingual and multi-source training. We trained 69 monolingual language models and 13 multilingual models for the shared task. Our multilingual approach making use of different resources yield better results than the monolingual approach for 11 languages. Our system ranked 5 th and achieved 70.93 overall LAS score over the 81 test corpora (macro-averaged LAS F1 score).

2016

pdf abs
More than Word Cooccurrence: Exploring Support and Opposition in International Climate Negotiations with Semantic Parsing
Pablo Ruiz Fabo | Clément Plancq | Thierry Poibeau
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Text analysis methods widely used in digital humanities often involve word co-occurrence, e.g. concept co-occurrence networks. These methods provide a useful corpus overview, but cannot determine the predicates that relate co-occurring concepts. Our goal was identifying propositions expressing the points supported or opposed by participants in international climate negotiations. Word co-occurrence methods were not sufficient, and an analysis based on open relation extraction had limited coverage for nominal predicates. We present a pipeline which identifies the points that different actors support and oppose, via a domain model with support/opposition predicates, and analysis rules that exploit the output of semantic role labelling, syntactic dependencies and anaphora resolution. Entity linking and keyphrase extraction are also performed on the propositions related to each actor. A user interface allows examining the main concepts in points supported or opposed by each participant, which participants agree or disagree with each other, and about which issues. The system is an example of tools that digital humanities scholars are asking for, to render rich textual information (beyond word co-occurrence) more amenable to quantitative treatment. An evaluation of the tool was satisfactory.

pdf bib
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning
Anna Korhonen | Alessandro Lenci | Brian Murphy | Thierry Poibeau | Aline Villavicencio
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning

pdf abs
The Role of Intrinsic Motivation in Artificial Language Emergence: a Case Study on Colour
Miquel Cornudella | Thierry Poibeau | Remi van Trijp
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Human languages have multiple strategies that allow us to discriminate objects in a vast variety of contexts. Colours have been extensively studied from this point of view. In particular, previous research in artificial language evolution has shown how artificial languages may emerge based on specific strategies to distinguish colours. Still, it has not been shown how several strategies of diverse complexity can be autonomously managed by artificial agents . We propose an intrinsic motivation system that allows agents in a population to create a shared artificial language and progressively increase its expressive power. Our results show that with such a system agents successfully regulate their language development, which indicates a relation between population size and consistency in the emergent communicative systems.

pdf abs
Exploring a Continuous and Flexible Representation of the Lexicon
Pierre Marchal | Thierry Poibeau
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We aim at showing that lexical descriptions based on multifactorial and continuous models can be used by linguists and lexicographers (and not only by machines) so long as they are provided with a way to efficiently navigate data collections. We propose to demonstrate such a system.

2015

pdf bib
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning
Robert Berwick | Anna Korhonen | Alessandro Lenci | Thierry Poibeau | Aline Villavicencio
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

pdf
Language Emergence in a Population of Artificial Agents Equipped with the Autotelic Principle
Miquel Cornudella | Thierry Poibeau
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

pdf
ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators
Pablo Ruiz | Thierry Poibeau | Frédérique Mélanie
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf
Combining Open Source Annotators for Entity Linking through Weighted Voting
Pablo Ruiz | Thierry Poibeau
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf
EL92: Entity Linking Combining Open Source Annotators via Weighted Voting
Pablo Ruiz | Thierry Poibeau
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf abs
Mapping the Natural Language Processing Domain: Experiments using the ACL Anthology
Elisa Omodei | Jean-Philippe Cointet | Thierry Poibeau
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper investigates the evolution of the computational linguistics domain through a quantitative analysis of the ACL Anthology (containing around 12,000 papers published between 1985 and 2008). Our approach combines complex system methods with natural language processing techniques. We reconstruct the socio-semantic landscape of the domain by inferring a co-authorship and a semantic network from the analysis of the corpus. First, keywords are extracted using a hybrid approach mixing linguistic patterns with statistical information. Then, the semantic network is built using a co-occurrence analysis of these keywords within the corpus. Combining temporal and network analysis techniques, we are able to examine the main evolutions of the field and the more active subfields over time. Lastly we propose a model to explore the mutual influence of the social and the semantic network over time, leading to a socio-semantic co-evolutionary system.

pdf
Argumentative analysis of the ACL Anthology (Analyse argumentative du corpus de l’ACL (ACL Anthology)) [in French]
Elisa Omodei | Yufan Guo | Jean-Philippe Cointet | Thierry Poibeau
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)
Alessandro Lenci | Muntsa Padró | Thierry Poibeau | Aline Villavicencio
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)

pdf
Social and Semantic Diversity: Socio-semantic Representation of a Scientific Corpus
Thierry Poibeau | Elisa Omodei | Jean-Philippe Cointet | Yufan Guo
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf
Processing Mutations in Breton with Finite-State Transducers
Thierry Poibeau
Proceedings of the First Celtic Language Technology Workshop

pdf bib
Introduction: Cognitive Issues in Natural Language Processing
Thierry Poibeau | Shravan Vasishth
Traitement Automatique des Langues, Volume 55, Numéro 3 : Traitement automatique du langage naturel et sciences cognitives [Natural Language Processing and Cognitive Sciences]

2013

pdf
A Tensor-based Factorization Model of Semantic Compositionality
Tim Van de Cruys | Thierry Poibeau | Anna Korhonen
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf
Multi-way Tensor Factorization for Unsupervised Lexical Acquisition
Tim Van de Cruys | Laura Rimell | Thierry Poibeau | Anna Korhonen
Proceedings of COLING 2012

pdf bib
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss
Robert Berwick | Anna Korhonen | Thierry Poibeau | Aline Villavicencio
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf abs
ANALEC: a New Tool for the Dynamic Annotation of Textual Data
Frédéric Landragin | Thierry Poibeau | Bernard Victorri
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We introduce ANALEC, a tool which aim is to bring together corpus annotation, visualization and query management. Our main idea is to provide a unified and dynamic way of annotating textual data. ANALEC allows researchers to dynamically build their own annotation scheme and use the possibilities of scheme revision, data querying and graphical visualization during the annotation process. Each query result can be visualized using a graphical representation that puts forward a set of annotations that can be directly corrected or completed. Text annotation is then considered as a cyclic process. We show that statistics like frequencies and correlations make it possible to verify annotated data on the fly during the annotation. In this paper we introduce the annotation functionalities of ANALEC, some of the annotated data visualization functionalities, and three statistical modules: frequency, correlation and geometrical representations. Some examples dealing with reference and coreference annotation illustrate the main contributions of ANALEC.

L’objectif de cet article est d’évaluer dans quelle mesure les “fonctions syntaxiques” qui figurent dans une partie du corpus arboré de Paris 7 sont apprenables à partir d’exemples. La technique d’apprentissage automatique employée pour cela fait appel aux “Champs Aléatoires Conditionnels” (Conditional Random Fields ou CRF), dans une variante adaptée à l’annotation d’arbres. Les expériences menées sont décrites en détail et analysées. Moyennant un bon paramétrage, elles atteignent une F1-mesure de plus de 80%.

pdf bib
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Prise de position
Adeline Nazarenko | Thierry Poibeau
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Prise de position

pdf bib
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts
Adeline Nazarenko | Thierry Poibeau
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

pdf bib
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations
Adeline Nazarenko | Thierry Poibeau
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

pdf bib
CBSEAS, a Summarization System – Integration of Opinion Mining Techniques to Summarize Blogs
Aurélien Bossard | Michel Généreux | Thierry Poibeau
Proceedings of the Demonstrations Session at EACL 2009

2008

pdf abs
Do we Still Need Gold Standards for Evaluation?
Thierry Poibeau | Cédric Messiant
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The availability of a huge mass of textual data in electronic format has increased the need for fast and accurate techniques for textual data processing. Machine learning and statistical approaches have been increasingly used in NLP since a decade, mainly because they are quick, versatile and efficient. However, despite this evolution of the field, evaluation still rely (most of the time) on a comparison between the output of a probabilistic or statistical system on the one hand, and a non-statistic, most of the time hand-crafted, gold standard on the other hand. In this paper, we take the example of the acquisition of subcategorization frames from corpora as a practical example. Our study is motivated by the fact that, even if a gold standard is an invaluable resource for evaluation, a gold standard is always partial and does not really show how accurate and useful results are.

pdf abs
LexSchem: a Large Subcategorization Lexicon for French Verbs
Cédric Messiant | Thierry Poibeau | Anna Korhonen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents LexSchem - the first large, fully automatically acquired subcategorization lexicon for French verbs. The lexicon includes subcategorization frame and frequency information for 3297 French verbs. When evaluated on a set of 20 test verbs against a gold standard dictionary, it shows 0.79 precision, 0.55 recall and 0.65 F-measure. We have made this resource freely available to the research community on the web.

pdf abs
Regroupement automatique de documents en classes événementielles
Aurélien Bossard | Thierry Poibeau
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article porte sur le regroupement automatique de documents sur une base événementielle. Après avoir précisé la notion d’événement, nous nous intéressons à la représentation des documents d’un corpus de dépêches, puis à une approche d’apprentissage pour réaliser les regroupements de manière non supervisée fondée sur k-means. Enfin, nous évaluons le système de regroupement de documents sur un corpus de taille réduite et nous discutons de l’évaluation quantitative de ce type de tâche.

pdf bib
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization
Sivaji Bandyopadhyay | Thierry Poibeau | Horacio Saggion | Roman Yangarber
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization

2007

pdf
Automatically Restructuring Practice Guidelines using the GEM DTD
Amanda Bouffier | Thierry Poibeau
Biological, translational, and clinical language processing

pdf
UP13: Knowledge-poor Methods (Sometimes) Perform Poorly
Thierry Poibeau
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2005

pdf abs
Sur le statut référentiel des entités nommées
Thierry Poibeau
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous montrons dans cet article qu’une même entité peut être désignée de multiples façons et que les noms désignant ces entités sont par nature polysémiques. L’analyse ne peut donc se limiter à une tentative de résolution de la référence mais doit mettre en évidence les possibilités de nommage s’appuyant essentiellement sur deux opérations de nature linguistique : la synecdoque et la métonymie. Nous présentons enfin une modélisation permettant de rendre explicite les différentes désignations en discours, en unifiant le mode de représentation des connaissances linguistiques et des connaissances sur le monde.

2004

pdf
Semi-automatic Acquisition of Command Grammar
Thierry Poibeau | Bénédicte Goujon
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf
Automatic extraction of paraphrastic phrases from medium-size corpora
Thierry Poibeau
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

2002

pdf
Generating Extraction Patterns from a Large Semantic Network and an Untagged Corpus
Thierry Poibeau | Dominique Dutoit
COLING-02: SEMANET: Building and Using Semantic Networks

pdf
Evaluating resource acquisition tools for Information Extraction
Thierry Poibeau | Dominique Dutoit | Sophie Bizouard
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf
Inferring Knowledge from a Large Semantic Network
Dominique Dutoit | Thierry Poibeau
COLING 2002: The 19th International Conference on Computational Linguistics

pdf abs
Évaluer l’acquisition semi-automatique de classes sémantiques
Thierry Poibeau | Dominique Dutoit | Sophie Bizouard
Actes de la 9ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article vise à évaluer deux approches différentes pour la constitution de classes sémantiques. Une approche endogène (acquisition à partir d’un corpus) est contrastée avec une approche exogène (à travers un réseau sémantique riche). L’article présente une évaluation fine de ces deux techniques.

2001

pdf abs
Extraction d’information dans les bases de données textuelles en génomique au moyen de transducteurs à nombre fini d’états
Thierry Poibeau
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article décrit un système d’extraction d’information sur les interactions entre gènes à partir de grandes bases de données textuelles. Le système est fondé sur une analyse au moyen de transducteurs à nombre fini d’états. L’article montre comment une partie des ressources (verbes d’interaction) peut être acquise de manière semi-automatique. Une évaluation détaillée du système est fournie.

pdf abs
Extraction de noms propres à partir de textes variés: problématique et enjeux
Leila Kosseim | Thierry Poibeau
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Cet article porte sur l’identification de noms propres à partir de textes écrits. Les stratégies à base de règles développées pour des textes de type journalistique se révèlent généralement insuffisantes pour des corpus composés de textes ne répondant pas à des critères rédactionnels stricts. Après une brève revue des travaux effectués sur des corpus de textes de nature journalistique, nous présentons la problématique de l’analyse de textes variés en nous basant sur deux corpus composés de courriers électroniques et de transcriptions manuelles de conversations téléphoniques. Une fois les sources d’erreurs présentées, nous décrivons l’approche utilisée pour adapter un système d’extraction de noms propres développé pour des textes journalistiques à l’analyse de messages électroniques.

pdf abs
Intex et ses applications informatiques
Max Silberztein | Thierry Poibeau | Antonio Balvet
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Tutoriels

Intex est un environnement de développement utilisé pour construire, tester et accumuler rapidement des motifs morpho-syntaxiques qui apparaissent dans des textes écrits en langue naturelle. Un survol du système est présenté dans [Silberztein, 1999] , le manuel d’instruction est disponible [Silberztein 2000]. Chaque description élémentaire est représentée par une grammaire locale, qui est habituellement entrée en machine grâce à l’éditeur de graphe d’Intex. Une caractéristique importante d’Intex est que chaque grammaire locale peut être facilement réemployée dans d’autres grammaires locales. Typiquement, les développeurs construisent des graphes élémentaires qui sont équivalents à des transducteurs à états finis, et réemploient ces graphes dans d’autres graphes de plus en plus complexes. Une seconde caractéristique d’Intex est que les objets traités (grammaires, dictionnaires et textes) sont représentés de façon interne par des transducteurs à états finis. En conséquence, toutes les fonctionnalités du système se ramènent à un nombre limité d’opérations sur des transducteurs. Par exemple, appliquer une grammaire à un texte revient à construire l’union des transducteurs élémentaires, la déterminiser, puis à calculer l’intersection du résultat avec le transducteur du texte. Cette architecture permet d’utiliser des algorithmes efficaces (par ex. lorsqu’on applique un transducteur déterministe à un texte préalablement indexé), et donne à Intex la puissance d’une machine de Turing (grâce à la possibilité d’appliquer des transducteurs en cascade). Dans ce tutoriel, nous montrerons comment utiliser un outil linguistique tel qu’Intex dans des environnements informatiques. Nous nous appuierons sur des applications de filtrage et d’extraction d’information, réalisées notamment au centre de recherche de Thales. Les applications suivantes seront détaillées, tant sur le plan linguistique qu’informatique filtrage d’information a partir d’un flux AFP [Meunier et al. l999] extraction de tables d’interaction entre gènes à partir de bases de données textuelles en génomique. [Poibeau 2001] Le tutoriel montrera comment Intex peut être employé comme moteur de filtrage d’un flux de dépêches de type AFP dans un cadre industriel. Il détaillera également les fonctionnalités de transformations des textes (transduction) permettant de passer rapidement de structures linguistiques variées à des formes normalisées permettant de remplir une base de données. Sur le plan informatique, on détaillera l’appel aux routines Intex, les paramétrages possibles (découpage en phrases, choix des dictionnaires...), et on survolera les nouvelles possibilités d’intégration (Intex API).