Benoît Sagot

Also published as: Benoit Sagot

2021

pdf bib abs
First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT
Benjamin Muller | Yanai Elazar | Benoît Sagot | Djamé Seddah
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model’s internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.

pdf bib
Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task?
Clémentine Fourrier | Rachel Bawden | Benoît Sagot
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib abs
Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?
Arij Riabi | Benoît Sagot | Djamé Seddah
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high- resource languages. Building language mod- els and, more generally, NLP systems for non- standardized and low-resource languages remains a challenging task. In this work, we fo- cus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data display- ing a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre- trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set- tings.

pdf bib abs
When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models
Benjamin Muller | Antonios Anastasopoulos | Benoît Sagot | Djamé Seddah
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. We show that transliterating those languages significantly improves the potential of large-scale multilingual language models on downstream tasks. This result provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.

pdf bib abs
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering
Arij Riabi | Thomas Scialom | Rachel Keraron | Benoît Sagot | Djamé Seddah | Jacopo Staiano
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).

2020

pdf bib abs
Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l’hétérogénéité des données d’entrainement (C AMEM BERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity )
Louis Martin | Benjamin Muller | Pedro Javier Ortiz Suárez | Yoann Dupont | Laurent Romary | Éric Villemonte de la Clergerie | Benoît Sagot | Djamé Seddah
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Les modèles de langue neuronaux contextuels sont désormais omniprésents en traitement automatique des langues. Jusqu’à récemment, la plupart des modèles disponibles ont été entraînés soit sur des données en anglais, soit sur la concaténation de données dans plusieurs langues. L’utilisation pratique de ces modèles — dans toutes les langues sauf l’anglais — était donc limitée. La sortie récente de plusieurs modèles monolingues fondés sur BERT (Devlin et al., 2019), notamment pour le français, a démontré l’intérêt de ces modèles en améliorant l’état de l’art pour toutes les tâches évaluées. Dans cet article, à partir d’expériences menées sur CamemBERT (Martin et al., 2019), nous montrons que l’utilisation de données à haute variabilité est préférable à des données plus uniformes. De façon plus surprenante, nous montrons que l’utilisation d’un ensemble relativement petit de données issues du web (4Go) donne des résultats aussi bons que ceux obtenus à partir d’ensembles de données plus grands de deux ordres de grandeurs (138Go).

pdf bib abs
Comparing Statistical and Neural Models for Learning Sound Correspondences
Clémentine Fourrier | Benoît Sagot
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

Cognate prediction and proto-form reconstruction are key tasks in computational historical linguistics that rely on the study of sound change regularity. Solving these tasks appears to be very similar to machine translation, though methods from that field have barely been applied to historical linguistics. Therefore, in this paper, we investigate the learnability of sound correspondences between a proto-language and daughter languages for two machine-translation-inspired models, one statistical, the other neural. We first carry out our experiments on plausible artificial languages, without noise, in order to study the role of each parameter on the algorithms respective performance under almost perfect conditions. We then study real languages, namely Latin, Italian and Spanish, to see if those performances generalise well. We show that both model types manage to learn sound changes despite data scarcity, although the best performing model type depends on several parameters such as the size of the training data, the ambiguity, and the prediction direction.

pdf bib abs
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
Murielle Popa-Fabre | Pedro Javier Ortiz Suárez | Benoît Sagot | Éric de la Clergerie
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora

This paper investigates the impact of different types and size of training corpora on language models. By asking the fundamental question of quality versus quantity, we compare four French corpora by pre-training four different ELMos and evaluating them on dependency parsing, POS-tagging and Named Entities Recognition downstream tasks. We present and asses the relevance of a new balanced French corpus, CaBeRnet, that features a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative corpus will allow the language models to be more efficient, and therefore yield better evaluation scores on different evaluation sets and tasks. This paper offers three main contributions: (1) two newly built corpora: (a) CaBeRnet, a French Balanced Reference Corpus and (b) CBT-fr a domain-specific corpus having both oral and written style in youth literature, (2) five versions of ELMo pre-trained on differently built corpora, and (3) a whole array of computational results on downstream tasks that deepen our understanding of the effects of corpus balance and register in NLP evaluation.

pdf bib abs
Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0
Clémentine Fourrier | Benoît Sagot
Proceedings of the 12th Language Resources and Evaluation Conference

Diachronic lexical information is not only important in the field of historical linguistics, but is also increasingly used in NLP, most recently for machine translation of low resource languages. Therefore, there is a need for fine-grained, large-coverage and accurate etymological lexical resources. In this paper, we propose a set of guidelines to generate such resources, for each step of the life-cycle of an etymological lexicon: creation, update, evaluation, dissemination, and exploitation. To illustrate the guidelines, we introduce EtymDB 2.0, an etymological database automatically generated from the Wiktionary, which contains 1.8 million lexemes, linked by more than 700,000 fine-grained etymological relations, across 2,536 living and dead languages. We also introduce use cases for which EtymDB 2.0 could represent a key resource, such as phylogenetic tree generation, low resource machine translation or medieval languages study.

pdf bib abs
OFrLex: A Computational Morphological and Syntactic Lexicon for Old French
Gaël Guibon | Benoît Sagot
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper we describe our work on the development and enrichment of OFrLex, a freely available, large-coverage morphological and syntactic Old French lexicon. We rely on several heterogeneous language resources to extract structured and exploitable information. The extraction follows a semi-automatic procedure with substantial manual steps to respond to difficulties encountered while aligning lexical entries from distinct language resources. OFrLex aims at improving natural language processing tasks on Old French such as part-of-speech tagging and dependency parsing. We provide quantitative information on OFrLex and discuss its reliability. We also describe and evaluate a semi-automatic, word-embedding-based lexical enrichment process aimed at increasing the accuracy of the resource. Results of this extension technique will be manually validated in the near future, a step that will take advantage of OFrLex’s viewing, searching and editing interface, which is already accessible online.

pdf bib abs
Establishing a New State-of-the-Art for French Named Entity Recognition
Pedro Javier Ortiz Suárez | Yoann Dupont | Benjamin Muller | Laurent Romary | Benoît Sagot
Proceedings of the 12th Language Resources and Evaluation Conference

The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations.

pdf bib abs
Controllable Sentence Simplification
Louis Martin | Éric de la Clergerie | Benoît Sagot | Antoine Bordes
Proceedings of the 12th Language Resources and Evaluation Conference

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on attributes such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these attributes allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), establishes the state of the art at 41.87 SARI on the WikiLarge test set, a +1.42 improvement over the best previously reported score.

We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing. We believe that what we present in this paper is useful beyond the low-resource language community. This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches.

pdf bib abs
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
Pedro Javier Ortiz Suárez | Laurent Romary | Benoît Sagot
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.

pdf bib abs
ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations
Fernando Alva-Manchego | Louis Martin | Antoine Bordes | Carolina Scarton | Benoît Sagot | Lucia Specia
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In order to simplify a sentence, human editors perform multiple rewriting transformations: they split it into several shorter sentences, paraphrase words (i.e. replacing complex words or phrases by simpler synonyms), reorder components, and/or delete information deemed unnecessary. Despite these varied range of possible text alterations, current models for automatic sentence simplification are evaluated using datasets that are focused on a single transformation, such as lexical paraphrasing or splitting. This makes it impossible to understand the ability of simplification models in more realistic settings. To alleviate this limitation, this paper introduces ASSET, a new dataset for assessing sentence simplification in English. ASSET is a crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations. Through quantitative and qualitative experiments, we show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task. Furthermore, we motivate the need for developing better methods for automatic evaluation using ASSET, since we show that current popular metrics may not be suitable when multiple simplification transformations are performed.

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

2019

pdf bib abs
What Does BERT Learn about the Structure of Language?
Ganesh Jawahar | Benoît Sagot | Djamé Seddah
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

BERT is a recent language representation model that has surprisingly performed well in diverse language understanding benchmarks. This result indicates the possibility that BERT networks capture structural information about language. In this work, we provide novel support for this claim by performing a series of experiments to unpack the elements of English language structure learned by BERT. Our findings are fourfold. BERT’s phrasal representation captures the phrase-level information in the lower layers. The intermediate layers of BERT compose a rich hierarchy of linguistic information, starting with surface features at the bottom, syntactic features in the middle followed by semantic features at the top. BERT requires deeper layers while tracking subject-verb agreement to handle long-term dependency problem. Finally, the compositional scheme underlying BERT mimics classical, tree-like structures.

pdf bib abs
Développement d’un lexique morphologique et syntaxique de l’ancien français (Development of a morphological and syntactic lexicon of Old French)
Benoît Sagot
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

Nous décrivons dans cet article notre travail de développement d’un lexique morphologique et syntaxique à grande échelle de l’ancien français pour le traitement automatique des langues. Nous nous sommes appuyés sur des ressources dictionnairiques et lexicales dans lesquelles l’extraction d’informations structurées et exploitables a nécessité des développements spécifiques. De plus, la mise en correspondance d’informations provenant de ces différentes sources a soulevé des difficultés. Nous donnons quelques indications quantitatives sur le lexique obtenu, et discutons de sa fiabilité dans sa version actuelle et des perspectives d’amélioration permises par l’existence d’une première version, notamment au travers de l’analyse automatique de données textuelles.

pdf bib abs
Enhancing BERT for Lexical Normalization
Benjamin Muller | Benoit Sagot | Djamé Seddah
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neural models on a great variety of tasks. However, it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC), we study the ability of BERT to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge, it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.

2018

pdf bib abs
ELMoLex: Connecting ELMo and Lexicon Features for Dependency Parsing
Ganesh Jawahar | Benjamin Muller | Amal Fethi | Louis Martin | Éric Villemonte de la Clergerie | Benoît Sagot | Djamé Seddah
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper, we present the details of the neural dependency parser and the neural tagger submitted by our team ‘ParisNLP’ to the CoNLL 2018 Shared Task on parsing from raw text to Universal Dependencies. We augment the deep Biaffine (BiAF) parser (Dozat and Manning, 2016) with novel features to perform competitively: we utilize an indomain version of ELMo features (Peters et al., 2018) which provide context-dependent word representations; we utilize disambiguated, embedded, morphosyntactic features from lexicons (Sagot, 2018), which complements the existing feature set. Henceforth, we call our system ‘ELMoLex’. In addition to incorporating character embeddings, ELMoLex benefits from pre-trained word vectors, ELMo and morphosyntactic features (whenever available) to correctly handle rare or unknown words which are prevalent in languages with complex morphology. ELMoLex ranked 11th by Labeled Attachment Score metric (70.64%), Morphology-aware LAS metric (55.74%) and ranked 9th by Bilexical dependency metric (60.70%).

pdf bib
A multilingual collection of CoNLL-U-compatible morphological lexicons
Benoît Sagot
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Cheating a Parser to Death: Data-driven Cross-Treebank Annotation Transfer
Djamé Seddah | Eric de la Clergerie | Benoît Sagot | Héctor Martínez Alonso | Marie Candito
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Construction automatique d’une base de données étymologiques à partir du wiktionary (Automatic construction of an etymological database using Wiktionary)
Benoît Sagot
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Les ressources lexicales électroniques ne contiennent quasiment jamais d’informations étymologiques. De telles informations, convenablement formalisées, permettraient pourtant de développer des outils automatiques au service de la linguistique historique et comparative, ainsi que d’améliorer significativement le traitement automatique de langues anciennes. Nous décrivons ici le processus que nous avons mis en œuvre pour extraire des données étymologiques à partir des notices étymologiques du wiktionary, rédigées en anglais. Nous avons ainsi produit une base multilingue de près d’un million de lexèmes et une base de plus d’un demi-million de relations étymologiques entre lexèmes.

pdf bib abs
Annotating omission in statement pairs
Héctor Martínez Alonso | Amaury Delamaire | Benoît Sagot
Proceedings of the 11th Linguistic Annotation Workshop

We focus on the identification of omission in statement pairs. We compare three annotation schemes, namely two different crowdsourcing schemes and manual expert annotation. We show that the simplest of the two crowdsourcing approaches yields a better annotation quality than the more complex one. We use a dedicated classifier to assess whether the annotators’ behavior can be explained by straightforward linguistic features. The classifier benefits from a modeling that uses lexical information beyond length and overlap measures. However, for our task, we argue that expert and not crowdsourcing-based annotation is the best compromise between annotation cost and quality.

pdf bib abs
Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin
Géraldine Walther | Benoît Sagot
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In this paper, we present ongoing work for developing language resources and basic NLP tools for an undocumented variety of Romansh, in the context of a language documentation and language acquisition project. Our tools are meant to improve the speed and reliability of corpus annotations for noisy data involving large amounts of code-switching, occurrences of child-speech and orthographic noise. Being able to increase the efficiency of language resource development for language documentation and acquisition research also constitutes a step towards solving the data sparsity issues with which researchers have been struggling.

pdf bib abs
Improving neural tagging with lexical information
Benoît Sagot | Héctor Martínez Alonso
Proceedings of the 15th International Conference on Parsing Technologies

Neural part-of-speech tagging has achieved competitive results with the incorporation of character-based and pre-trained word embeddings. In this paper, we show that a state-of-the-art bi-LSTM tagger can benefit from using information from morphosyntactic lexicons as additional input. The tagger, trained on several dozen languages, shows a consistent, average improvement when using lexical information, even when also using character-based embeddings, thus showing the complementarity of the different sources of lexical information. The improvements are particularly important for the smaller datasets.

pdf bib abs
The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedy
Éric de La Clergerie | Benoît Sagot | Djamé Seddah
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present the ParisNLP entry at the UD CoNLL 2017 parsing shared task. In addition to the UDpipe models provided, we built our own data-driven tokenization models, sentence segmenter and lexicon-based morphological analyzers. All of these were used with a range of different parsing models (neural or not, feature-rich or not, transition or graph-based, etc.) and the best combination for each language was selected. Unfortunately, a glitch in the shared task’s Matrix led our model selector to run generic, weakly lexicalized models, tailored for surprise languages, instead of our dataset-specific models. Because of this #ParsingTragedy, we officially ranked 27th, whereas our real models finally unofficially ranked 6th.

2016

pdf bib abs
From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenario
Héctor Martínez Alonso | Djamé Seddah | Benoît Sagot
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

User-generated content presents many challenges for its automatic processing. While many of them do come from out-of-vocabulary effects, others spawn from different linguistic phenomena such as unusual syntax. In this work we present a French three-domain data set made up of question headlines from a cooking forum, game chat logs and associated forums from two popular online games (MINECRAFT & LEAGUE OF LEGENDS). We chose these domains because they encompass different degrees of lexical and syntactic compliance with canonical language. We conduct an automatic and manual evaluation of the difficulties of processing these domains for part-of-speech prediction, and introduce a pilot study to determine whether dependency analysis lends itself well to annotate these data. We also discuss the development cost of our data set.

pdf bib abs
Étiquetage multilingue en parties du discours avec MElt (Multilingual part-of-speech tagging with MElt)
Benoît Sagot
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Nous présentons des travaux récents réalisés autour de MElt, système discriminant d’étiquetage en parties du discours. MElt met l’accent sur l’exploitation optimale d’informations lexicales externes pour améliorer les performances des étiqueteurs par rapport aux modèles entraînés seulement sur des corpus annotés. Nous avons entraîné MElt sur plus d’une quarantaine de jeux de données couvrant plus d’une trentaine de langues. Comparé au système état-de-l’art MarMoT, MElt obtient en moyenne des résultats légèrement moins bons en l’absence de lexique externe, mais meilleurs lorsque de telles ressources sont disponibles, produisant ainsi des étiqueteurs état-de-l’art pour plusieurs langues.

2014

pdf bib abs
DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German
Benoît Sagot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual work, we show that is represents a good tradeoff between development cost, lexical coverage and resource accuracy.

pdf bib abs
A Language-independent Approach to Extracting Derivational Relations from an Inflectional Lexicon
Marion Baranes | Benoît Sagot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe and evaluate an unsupervised method for acquiring pairs of lexical entries belonging to the same morphological family, i.e., derivationally related words, starting from a purely inflectional lexicon. Our approach relies on transformation rules that relate lexical entries with the one another, and which are automatically extracted from the inflected lexicon based on surface form analogies and on part-of-speech information. It is generic enough to be applied to any language with a mainly concatenative derivational morphology. Results were obtained and evaluated on English, French, German and Spanish. Precision results are satisfying, and our French results favorably compare with another resource, although its construction relied on manually developed lexicographic information whereas our approach only requires an inflectional lexicon.

The Asfalda project aims to develop a French corpus with frame-based semantic annotations and automatic tools for shallow semantic analysis. We present the first part of the project: focusing on a set of notional domains, we delimited a subset of English frames, adapted them to French data when necessary, and developed the corresponding French lexicon. We believe that working domain by domain helped us to enforce the coherence of the resulting resource, and also has the advantage that, though the number of frames is limited (around a hundred), we obtain full coverage within a given domain.

pdf bib abs
An Open-Source Heavily Multilingual Translation Graph Extracted from Wiktionaries and Parallel Corpora
Valérie Hanoka | Benoît Sagot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes YaMTG (Yet another Multilingual Translation Graph), a new open-source heavily multilingual translation database (over 664 languages represented) built using several sources, namely various wiktionaries and the OPUS parallel corpora (Tiedemann, 2009). We detail the translation extraction process for 21 wiktionary language editions, and provide an evaluation of the translations contained in YaMTG.

pdf bib abs
A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
Yves Scherrer | Benoît Sagot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from monolingual corpora, based on cognate detection followed by cross-lingual contextual similarity. Second, POS information is transferred from the resourced language along translation pairs to the non-resourced language and used for tagging the corpus. We evaluate our methods on three language families, consisting of five Romance languages, three Germanic languages and five Slavic languages. We obtain tagging accuracies of up to 91.6%.

pdf bib
Automated Error Detection in Digitized Cultural Heritage Documents
Kata Gábor | Benoît Sagot
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf bib
Analogy-based Text Normalization : the case of unknowns words (Normalisation de textes par analogie: le cas des mots inconnus) [in French]
Marion Baranes | Benoît Sagot
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Sub-categorization in ‘pour’ and lexical syntax (Sous-catégorisation en pour et syntaxe lexicale) [in French]
Benoît Sagot | Laurence Danlos | Margot Colinet
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib
Named Entity Recognition and Correction in OCRized Corpora (Détection et correction automatique d’entités nommées dans des corpus OCRisés) [in French]
Benoît Sagot | Kata Gábor
Proceedings of TALN 2014 (Volume 2: Short Papers)

2013

pdf bib
Enforcing Subcategorization Constraints in a Parser Using Sub-parses Recombining
Seyed Abolghasem Mirroshandel | Alexis Nasr | Benoît Sagot
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Can MDL Improve Unsupervised Chinese Word Segmentation?
Pierre Magistry | Benoît Sagot
Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing

pdf bib
Lexicon induction and part-of-speech tagging of non-resourced languages without any bilingual resources
Yves Scherrer | Benoît Sagot
Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants

pdf bib
Dynamic extension of a French morphological lexicon based a text stream (Extension dynamique de lexiques morphologiques pour le français à partir d’un flux textuel) [in French]
Benoît Sagot | Damien Nouvel | Virginie Mouilleron | Marion Baranes
Proceedings of TALN 2013 (Volume 1: Long Papers)

2012

pdf bib
Unsupervized Word Segmentation: the Case for Mandarin Chinese
Pierre Magistry | Benoît Sagot
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
TCOF-POS : un corpus libre de français parlé annoté en morphosyntaxe (TCOF-POS : A Freely Available POS-Tagged Corpus of Spoken French) [in French]
Christophe Benzitoun | Karën Fort | Benoît Sagot
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Annotation référentielle du Corpus Arboré de Paris 7 en entités nommées (Referential named entity annotation of the Paris 7 French TreeBank) [in French]
Benoît Sagot | Marion Richard | Rosa Stern
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
A Joint Named Entity Recognition and Entity Linking System
Rosa Stern | Benoît Sagot | Frédéric Béchet
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

pdf bib
Population of a Knowledge Base for News Metadata from Unstructured Text and Web Data
Rosa Stern | Benoît Sagot
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

pdf bib
Statistical Parsing of Spanish and Data Driven Lemmatization
Joseph Le Roux | Benoît Sagot | Djamé Seddah
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf bib
Dictionary-ontology cross-enrichment
Emmanuel Eckard | Lucie Barque | Alexis Nasr | Benoît Sagot
Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon

pdf bib
The French Social Media Bank: a Treebank of Noisy User Generated Content
Djamé Seddah | Benoit Sagot | Marie Candito | Virginie Mouilleron | Vanessa Combet
Proceedings of COLING 2012

pdf bib abs
Applying cross-lingual WSD to wordnet development
Marianna Apidianaki | Benoît Sagot
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The automatic development of semantic resources constitutes an important challenge in the NLP community. The methods used generally exploit existing large-scale resources, such as Princeton WordNet, often combined with information extracted from multilingual resources and parallel corpora. In this paper we show how Cross-Lingual Word Sense Disambiguation can be applied to wordnet development. We apply the proposed method to WOLF, a free wordnet for French still under construction, in order to fill synsets that did not contain any literal yet and increase its coverage.

pdf bib abs
Evaluating and improving syntactic lexica by plugging them within a parser
Elsa Tolone | Benoît Sagot | Éric Villemonte de La Clergerie
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present some evaluation results for four French syntactic lexica, obtained through their conversion to the Alexina format used by the Lefff lexicon, and their integration within the large-coverage TAG-based FRMG parser. The evaluations are run on two test corpora, annotated with two distinct annotation formats, namely EASy/Passage chunks and relations and CoNLL dependencies. The information provided by the evaluation results provide valuable feedback about the four lexica. Moreover, when coupled with error mining techniques, they allow us to identify how these lexica might be improved.

pdf bib abs
Boosting the Coverage of a Semantic Lexicon by Automatically Extracted Event Nominalizations
Kata Gábor | Marianna Apidianaki | Benoît Sagot | Éric Villemonte de La Clergerie
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this article, we present a distributional analysis method for extracting nominalization relations from monolingual corpora. The acquisition method makes use of distributional and morphological information to select nominalization candidates. We explain how the learning is performed on a dependency annotated corpus and describe the nominalization results. Furthermore, we show how these results served to enrich an existing lexical resource, the WOLF (Wordnet Libre du FrancÂ¸ais). We present the techniques that we developed in order to integrate the new information into WOLF, based on both its structure and content. Finally, we evaluate the validity of the automatically obtained information and the correctness of its integration into the semantic resource. The method proved to be useful for boosting the coverage of WOLF and presents the advantage of filling verbal synsets, which are particularly difficult to handle due to the high level of verbal polysemy.

pdf bib abs
Aleda, a free large-scale entity database for French
Benoît Sagot | Rosa Stern
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Named entity recognition, which focuses on the identification of the span and type of named entity mentions in texts, has drawn the attention of the NLP community for a long time. However, many real-life applications need to know which real entity each mention refers to. For such a purpose, often refered to as entity resolution and linking, an inventory of entities is required in order to constitute a reference. In this paper, we describe how we extracted such a resource for French from freely available resources (the French Wikipedia and the GeoNames database). We describe the results of an instrinsic evaluation of the resulting entity database, named Aleda, as well as those of a task-based evaluation in the context of a named entity detection system. We also compare it with the NLGbAse database (Charton and Torres-Moreno, 2010), a resource with similar objectives.

pdf bib abs
Cleaning noisy wordnets
Benoît Sagot | Darja Fišer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Automatic approaches to creating and extending wordnets, which have become very popular in the past decade, inadvertently result in noisy synsets. This is why we propose an approach to detect synset outliers in order to eliminate the noise and improve accuracy of the developed wordnets, so that they become more useful lexico-semantic resources for natural language applications. The approach compares the words that appear in the synset and its surroundings with the contexts of the literals in question they are used in based on large monolingual corpora. By fine-tuning the outlier threshold we can influence how many outlier candidates will be eliminated. Although the proposed approach is language-independent we test it on Slovene and French that were created automatically from bilingual resources and contain plenty of disambiguation errors. Manual evaluation of the results shows that by applying a threshold similar to the estimated error rate in the respective wordnets, 67% of the proposed outlier candidates are indeed incorrect for French and a 64% for Slovene. This is a big improvement compared to the estimated overall error rates in the resources, which are 12% for French and 15% for Slovene.

pdf bib abs
Wordnet extension made simple: A multilingual lexicon-based approach using wiki resources
Valérie Hanoka | Benoît Sagot
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we propose a simple methodology for building or extending wordnets using easily extractible lexical knowledge from Wiktionary and Wikipedia. This method relies on a large multilingual translation/synonym graph in many languages as well as synset-aligned wordnets. It guesses frequent and polysemous literals that are difficult to find using other methods by looking at back-translations in the graph, showing that the use of a heavily multilingual lexicon can be a way to mitigate the lack of wide coverage bilingual lexicon for wordnet creation or extension. We evaluate our approach on French by applying it for extending WOLF, a freely available French wordnet.

2011

pdf bib abs
Un turc mécanique pour les ressources linguistiques : critique de la myriadisation du travail parcellisé (Mechanical Turk for linguistic resources: review of the crowdsourcing of parceled work)
Benoît Sagot | Karën Fort | Gilles Adda | Joseph Mariani | Bernard Lang
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article est une prise de position concernant les plate-formes de type Amazon Mechanical Turk, dont l’utilisation est en plein essor depuis quelques années dans le traitement automatique des langues. Ces plateformes de travail en ligne permettent, selon le discours qui prévaut dans les articles du domaine, de faire développer toutes sortes de ressources linguistiques de qualité, pour un prix imbattable et en un temps très réduit, par des gens pour qui il s’agit d’un passe-temps. Nous allons ici démontrer que la situation est loin d’être aussi idéale, que ce soit sur le plan de la qualité, du prix, du statut des travailleurs ou de l’éthique. Nous rappellerons ensuite les solutions alternatives déjà existantes ou proposées. Notre but est ici double : informer les chercheurs, afin qu’ils fassent leur choix en toute connaissance de cause, et proposer des solutions pratiques et organisationnelles pour améliorer le développement de nouvelles ressources linguistiques en limitant les risques de dérives éthiques et légales, sans que cela se fasse au prix de leur coût ou de leur qualité.

pdf bib abs
Segmentation et induction de lexique non-supervisées du mandarin (Unsupervised segmentation and induction of mandarin lexicon)
Pierre Magistry | Benoît Sagot
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Pour la plupart des langues utilisant l’alphabet latin, le découpage d’un texte selon les espaces et les symboles de ponctuation est une bonne approximation d’un découpage en unités lexicales. Bien que cette approximation cache de nombreuses difficultés, elles sont sans comparaison avec celles que l’on rencontre lorsque l’on veut traiter des langues qui, comme le chinois mandarin, n’utilisent pas l’espace. Un grand nombre de systèmes de segmentation ont été proposés parmi lesquels certains adoptent une approche non-supervisée motivée linguistiquement. Cependant les méthodes d’évaluation communément utilisées ne rendent pas compte de toutes les propriétés de tels systèmes. Dans cet article, nous montrons qu’un modèle simple qui repose sur une reformulation en termes d’entropie d’une hypothèse indépendante de la langue énoncée par Harris (1955), permet de segmenter un corpus et d’en extraire un lexique. Testé sur le corpus de l’Academia Sinica, notre système permet l’induction d’une segmentation et d’un lexique qui ont de bonnes propriétés intrinsèques et dont les caractéristiques sont similaires à celles du lexique sous-jacent au corpus segmenté manuellement. De plus, on constate une certaine corrélation entre les résultats du modèle de segmentation et les structures syntaxiques fournies par une sous-partie arborée corpus.

pdf bib abs
Coopération de méthodes statistiques et symboliques pour l’adaptation non-supervisée d’un système d’étiquetage en entités nommées (Statistical and symbolic methods cooperation for the unsupervised adaptation of a named entity recognition system)
Frédéric Béchet | Benoît Sagot | Rosa Stern
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

La détection et le typage des entités nommées sont des tâches pour lesquelles ont été développés à la fois des systèmes symboliques et probabilistes. Nous présentons les résultats d’une expérience visant à faire interagir le système à base de règles NP, développé sur des corpus provenant de l’AFP, intégrant la base d’entités Aleda et qui a une bonne précision, et le système LIANE, entraîné sur des transcriptions de l’oral provenant du corpus ESTER et qui a un bon rappel. Nous montrons qu’on peut adapter à un nouveau type de corpus, de manière non supervisée, un système probabiliste tel que LIANE grâce à des corpus volumineux annotés automatiquement par NP. Cette adaptation ne nécessite aucune annotation manuelle supplémentaire et illustre la complémentarité des méthodes numériques et symboliques pour la résolution de tâches linguistiques.

pdf bib abs
Construction d’un lexique des adjectifs dénominaux (Construction of a lexicon of denominal adjectives)
Jana Strnadová | Benoît Sagot
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Après une brève analyse linguistique des adjectifs dénominaux en français, nous décrivons le processus automatique que nous avons mis en place à partir de lexiques et de corpus volumineux pour construire un lexique d’adjectifs dénominaux dérivés de manière régulière. Nous estimons à la fois la précision et la couverture du lexique dérivationnel obtenu. À terme, ce lexique librement disponible aura été validé manuellement et contiendra également les adjectifs dénominaux à base supplétive.

pdf bib abs
Développement de ressources pour le persan : PerLex 2, nouveau lexique morphologique et MEltfa, étiqueteur morphosyntaxique (Development of resources for Persian: PerLex 2, a new morphological lexicon and MEltfa, a morphosyntactic tagger)
Benoît Sagot | Géraldine Walther | Pegah Faghiri | Pollet Samvelian
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Nous présentons une nouvelle version de PerLex, lexique morphologique du persan, une version corrigée et partiellement réannotée du corpus étiqueté BijanKhan (BijanKhan, 2004) et MEltfa, un nouvel étiqueteur morphosyntaxique librement disponible pour le persan. Après avoir développé une première version de PerLex (Sagot & Walther, 2010), nous en proposons donc ici une version améliorée. Outre une validation manuelle partielle, PerLex 2 repose désormais sur un inventaire de catégories linguistiquement motivé. Nous avons également développé une nouvelle version du corpus BijanKhan : elle contient des corrections significatives de la tokenisation ainsi qu’un réétiquetage à l’aide des nouvelles catégories. Cette nouvelle version du corpus a enfin été utilisée pour l’entraînement de MEltfa, notre étiqueteur morphosyntaxique pour le persan librement disponible, s’appuyant à la fois sur ce nouvel inventaire de catégories, sur PerLex 2 et sur le système d’étiquetage MElt (Denis & Sagot, 2009).

2010

pdf bib abs
A Lexicon of French Quotation Verbs for Automatic Quotation Extraction
Benoît Sagot | Laurence Danlos | Rosa Stern
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Quotation extraction is an important information extraction task, especially when dealing with news wires. Quotations can be found in various configurations. In this paper, we focus on direct quotations introduced by a parenthetical clause, headed by a ""quotation verb"". Our study is based on a large French news wire corpus from the Agence France-Presse. We introduce and motivate an analysis at the discursive level of such quotations, which differs from the syntactic analyses generally proposed. We show how we enriched the Lefff syntactic lexicon so that it provides an account for quotation verbs heading a quotation parenthetical, especially those extracted from a news wire corpus. We also sketch how these lexical entries can be extended to the discursive level in order to model quotations introduced in a parenthetical clause in a complete way.

pdf bib abs
A Morphological Lexicon for the Persian Language
Benoît Sagot | Géraldine Walther
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We introduce PerLex, a large-coverage and freely-available morphological lexicon for the Persian language. We describe the main features of the Persian morphology, and the way we have represented it within the Alexina formalism, on which PerLex is based. We focus on the methodology we used for constructing lexical entries from various sources, as well as the problems related to typographic normalisation. The resulting lexicon shows a satisfying coverage on a reference corpus and should therefore be a good starting point for developing a syntactic lexicon for the Persian language.

pdf bib abs
The Lefff, a Freely Available and Large-coverage Morphological and Syntactic Lexicon for French
Benoît Sagot
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we introduce the Lefff, a freely available, accurate and large-coverage morphological and syntactic lexicon for French, used in many NLP tools such as large-coverage parsers. We first describe Alexina, the lexical framework in which the Lefff is developed as well as the linguistic notions and formalisms it is based on. Next, we describe the various sources of lexical data we used for building the Lefff, in particular semi-automatic lexical development techniques and conversion and merging of existing resources. Finally, we illustrate the coverage and precision of the resource by comparing it with other resources and by assessing its impact in various NLP tools.

pdf bib abs
Exploitation d’une ressource lexicale pour la construction d’un étiqueteur morpho-syntaxique état-de-l’art du français
Pascal Denis | Benoît Sagot
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente MEltfr, un étiqueteur morpho-syntaxique automatique du français. Il repose sur un modèle probabiliste séquentiel qui bénéficie d’informations issues d’un lexique exogène, à savoir le Lefff. Evalué sur le FTB, MEltfr atteint un taux de précision de 97.75% (91.36% sur les mots inconnus) sur un jeu de 29 étiquettes. Ceci correspond à une diminution du taux d’erreur de 18% (36.1% sur les mots inconnus) par rapport au même modèle sans couplage avec le Lefff. Nous étudions plus en détail la contribution de cette ressource, au travers de deux séries d’expériences. Celles-ci font apparaître en particulier que la contribution des traits issus du Lefff est de permettre une meilleure couverture, ainsi qu’une modélisation plus fine du contexte droit des mots.

pdf bib abs
Développement de ressources pour le persan: lexique morphologique et chaîne de traitements de surface
Benoît Sagot | Géraldine Walther
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous présentons PerLex, un lexique morphologique du persan à large couverture et librement disponible, accompagné d’une chaîne de traitements de surface pour cette langue. Nous décrivons quelques caractéristiques de la morphologie du persan, et la façon dont nous l’avons représentée dans le formalisme lexical Alexina, sur lequel repose PerLex. Nous insistons sur la méthodologie que nous avons employée pour construire les entrées lexicales à partir de diverses sources, ainsi que sur les problèmes liés à la normalisation typographique. Le lexique obtenu a une couverture satisfaisante sur un corpus de référence, et devrait donc constituer un bon point de départ pour le développement d’un lexique syntaxique du persan.

pdf bib abs
Ponctuations fortes abusives
Laurence Danlos | Benoît Sagot
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Certaines ponctuations fortes sont « abusivement » utilisées à la place de ponctuations faibles, débouchant sur des phrases graphiques qui ne sont pas des phrases grammaticales. Cet article présente une étude sur corpus de ce phénomène et une ébauche d’outil pour repérer automatiquement les ponctuations fortes abusives.

pdf bib abs
Traitement des inconnus : une approche systématique de l’incomplétude lexicale
Helena Blancafort | Gaëlle Recourcé | Javier Couto | Benoît Sagot | Rosa Stern | Denis Teyssou
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article aborde le phénomène de l’incomplétude des ressources lexicales, c’est-à-dire la problématique des inconnus, dans un contexte de traitement automatique. Nous proposons tout d’abord une définition opérationnelle de la notion d’inconnu. Nous décrivons ensuite une typologie des différentes classes d’inconnus, motivée par des considérations linguistiques et applicatives ainsi que par l’annotation des inconnus d’un petit corpus selon notre typologie. Cette typologie sera mise en oeuvre et validée par l’annotation d’un corpus important de l’Agence France-Presse dans le cadre du projet EDyLex.

pdf bib abs
Détection et résolution d’entités nommées dans des dépêches d’agence
Rosa Stern | Benoît Sagot
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Nous présentons NP, un système de reconnaissance d’entités nommées. Comprenant un module de résolution, il permet d’associer à chaque occurrence d’entité le référent qu’elle désigne parmi les entrées d’un référentiel dédié. NP apporte ainsi des informations pertinentes pour l’exploitation de l’extraction d’entités nommées en contexte applicatif. Ce système fait l’objet d’une évaluation grâce au développement d’un corpus annoté manuellement et adapté aux tâches de détection et de résolution.

pdf bib
Optimal Rank Reduction for Linear Context-Free Rewriting Systems with Fan-Out Two
Benoît Sagot | Giorgio Satta
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Influence of Pre-Annotation on POS-Tagged Corpus Development
Karën Fort | Benoît Sagot
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
Control Verb, Argument Cluster Coordination and Multi Component TAG
Djamé Seddah | Benoit Sagot | Laurence Danlos
Proceedings of the 10th International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+10)

2009

pdf bib
MICA: A Probabilistic Dependency Parser Based on Tree Insertion Grammars (Application Note)
Srinivas Bangalore | Pierre Boullier | Alexis Nasr | Owen Rambow | Benoît Sagot
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

pdf bib abs
Trouver et confondre les coupables : un processus sophistiqué de correction de lexique
Lionel Nicolas | Benoît Sagot | Miguel A. Molinero | Jacques Farré | Éric Villemonte De La Clergerie
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

La couverture d’un analyseur syntaxique dépend avant tout de la grammaire et du lexique sur lequel il repose. Le développement d’un lexique complet et précis est une tâche ardue et de longue haleine, surtout lorsque le lexique atteint un certain niveau de qualité et de couverture. Dans cet article, nous présentons un processus capable de détecter automatiquement les entrées manquantes ou incomplètes d’un lexique, et de suggérer des corrections pour ces entrées. La détection se réalise au moyen de deux techniques reposant soit sur un modèle statistique, soit sur les informations fournies par un étiqueteur syntaxique. Les hypothèses de corrections pour les entrées lexicales détectées sont générées en étudiant les modifications qui permettent d’améliorer le taux d’analyse des phrases dans lesquelles ces entrées apparaissent. Le processus global met en oeuvre plusieurs techniques utilisant divers outils tels que des étiqueteurs et des analyseurs syntaxiques ou des classifieurs d’entropie. Son application au Lefff , un lexique morphologique et syntaxique à large couverture du français, nous a déjà permis de réaliser des améliorations notables.

pdf bib abs
Intégrer les tables du Lexique-Grammaire à un analyseur syntaxique robuste à grande échelle
Benoît Sagot | Elsa Tolone
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous montrons comment nous avons converti les tables du Lexique-Grammaire en un format TAL, celui du lexique Lefff, permettant ainsi son intégration dans l’analyseur syntaxique FRMG. Nous présentons les fondements linguistiques de ce processus de conversion et le lexique obtenu. Nous validons le lexique obtenu en évaluant l’analyseur syntaxique FRMG sur le corpus de référence de la campagne EASy selon qu’il utilise les entrées verbales du Lefff ou celles des tables des verbes du Lexique-Grammaire ainsi converties.

pdf bib
Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort
Pascal Denis | Benoît Sagot
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf bib
Constructing parse forests that include exactly the n-best PCFG trees
Pierre Boullier | Alexis Nasr | Benoît Sagot
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib
Parsing Directed Acyclic Graphs with Range Concatenation Grammars
Pierre Boullier | Benoît Sagot
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib
Building a morphological and syntactic lexicon by merging various linguistic resources
Miguel A. Molinero | Benoît Sagot | Lionel Nicolas
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib
A Morphological and Syntactic Wide-coverage Lexicon for Spanish: The Leffe
Miguel A. Molinero | Benoît Sagot | Lionel Nicolas
Proceedings of the International Conference RANLP-2009

2008

pdf bib
Computer Aided Correction and Extension of a Syntactic Wide-Coverage Lexicon
Lionel Nicolas | Benoît Sagot | Miguel A. Molinero | Jacques Farré | Éric de la Clergerie
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib abs
Construction d’un wordnet libre du français à partir de ressources multilingues
Benoît Sagot | Darja Fišer
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article décrit la construction d’un Wordnet Libre du Français (WOLF) à partir du Princeton WordNet et de diverses ressources multilingues. Les lexèmes polysémiques ont été traités au moyen d’une approche reposant sur l’alignement en mots d’un corpus parallèle en cinq langues. Le lexique multilingue extrait a été désambiguïsé sémantiquement à l’aide des wordnets des langues concernées. Par ailleurs, une approche bilingue a été suffisante pour construire de nouvelles entrées à partir des lexèmes monosémiques. Nous avons pour cela extrait des lexiques bilingues à partir deWikipédia et de thésaurus. Le wordnet obtenu a été évalué par rapport au wordnet français issu du projet EuroWordNet. Les résultats sont encourageants, et des applications sont d’ores et déjà envisagées.

2007

pdf bib abs
Comparaison du Lexique-Grammaire des verbes pleins et de DICOVALENCE : vers une intégration dans le Lefff
Laurence Danlos | Benoît Sagot
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article compare le Lexique-Grammaire des verbes pleins et DICOVALENCE, deux ressources lexicales syntaxiques pour le français développées par des linguistes depuis de nombreuses années. Nous étudions en particulier les divergences et les empiètements des modèles lexicaux sous-jacents. Puis nous présentons le Lefff , lexique syntaxique à grande échelle pour le TAL, et son propre modèle lexical. Nous montrons que ce modèle est à même d’intégrer les informations lexicales présentes dans le Lexique-Grammaire et dans DICOVALENCE. Nous présentons les résultats des premiers travaux effectués en ce sens, avec pour objectif à terme la constitution d’un lexique syntaxique de référence pour le TAL.

pdf bib
Are Very Large Context-Free Grammars Tractable?
Pierre Boullier | Benoît Sagot
Proceedings of the Tenth International Conference on Parsing Technologies

2006

pdf bib abs
Deep non-probabilistic parsing of large corpora
Benoît Sagot | Pierre Boullier
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper reports a large-scale non-probabilistic parsing experiment with a deep LFG parser. We briefly introduce the parser we used, named SXLFG, and the resources that were used together with it. Then we report quantitative results about the parsing of a multi-million word journalistic corpus. We show that we can parse more than 6 million words in less than 12 hours, only 6.7% of all sentences reaching the 1s timeout. This shows that deep large-coverage non-probabilistic parsers can be efficient enough to parse very large corpora in a reasonable amount of time.

pdf bib abs
The Lefff 2 syntactic lexicon for French: architecture, acquisition, use
Benoît Sagot | Lionel Clément | Éric Villemonte de La Clergerie | Pierre Boullier
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we introduce a new lexical resource for French which is freely available as the second version of the Lefff (Lexique des formes fléchies du français - Lexicon of French inflected forms). It is a wide-coverage morphosyntactic and syntactic lexicon, whose architecture relies on properties inheritance, which makes it more compact and more easily maintainable and allows to describe lexical entries independantly from the formalisms it is used for. For these two reasons, we define it as a meta-lexicon. We describe its architecture, several automatic or semi-automatic approaches we use to acquire, correct and/or enrich such a lexicon, as well as the way it is used both with an LFG parser and with a TAG parser based on a meta-grammar, so as to build two large-coverage parsers for French. The web site of the Lefff is http://www.lefff.net/.

pdf bib
Modeling and Analysis of Elliptic Coordination by Dynamic Exploitation of Derivation Forests in LTAG Parsing
Djamé Seddah | Benoît Sagot
Proceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms

pdf bib abs
Trouver le coupable : Fouille d’erreurs sur des sorties d’analyseurs syntaxiques
Benoît Sagot | Éric Villemonte De La Clergerie
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous présentons une méthode de fouille d’erreurs pour détecter automatiquement des erreurs dans les ressources utilisées par les systèmes d’analyse syntaxique. Nous avons mis en oeuvre cette méthode sur le résultat de l’analyse de plusieurs millions de mots par deux systèmes d’analyse différents qui ont toutefois en commun le lexique syntaxique et la chaîne de traitement pré-syntaxique. Nous avons pu identifier ainsi des inexactitudes et des incomplétudes dans les ressources utilisées. En particulier, la comparaison des résultats obtenus sur les sorties des deux analyseurs sur un même corpus nous a permis d’isoler les problèmes issus des ressources partagées de ceux issus des grammaires.

pdf bib abs
Modélisation et analyse des coordinations elliptiques par l’exploitation dynamique des forêts de dérivation
Djamé Seddah | Benoît Sagot
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Nous présentons dans cet article une approche générale pour la modélisation et l’analyse syntaxique des coordinations elliptiques. Nous montrons que les lexèmes élidés peuvent être remplacés, au cours de l’analyse, par des informations qui proviennent de l’autre membre de la coordination, utilisé comme guide au niveau des dérivations. De plus, nous montrons comment cette approche peut être effectivement mise en oeuvre par une légère extension des Grammaires d’Arbres Adjoints Lexicalisées (LTAG) à travers une opération dite de fusion. Nous décrivons les algorithmes de dérivation nécessaires pour l’analyse de constructions coordonnées pouvant comporter un nombre quelconque d’ellipses.

pdf bib
Error Mining in Parsing Results
Benoît Sagot | Éric de la Clergerie
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2005

pdf bib
Efficient and Robust LFG Parsing: SxLFG
Pierre Boullier | Benoît Sagot
Proceedings of the Ninth International Workshop on Parsing Technology

pdf bib abs
Chaînes de traitement syntaxique
Pierre Boullier | Lionel Clément | Benoît Sagot | Éric Villemonte De La Clergerie
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article expose l’ensemble des outils que nous avons mis en oeuvre pour la campagne EASy d’évaluation d’analyse syntaxique. Nous commençons par un aperçu du lexique morphologique et syntaxique utilisé. Puis nous décrivons brièvement les propriétés de notre chaîne de traitement pré-syntaxique qui permet de gérer des corpus tout-venant. Nous présentons alors les deux systèmes d’analyse que nous avons utilisés, un analyseur TAG issu d’une méta-grammaire et un analyseur LFG. Nous comparons ces deux systèmes en indiquant leurs points communs, comme l’utilisation intensive du partage de calcul et des représentations compactes de l’information, mais également leurs différences, au niveau des formalismes, des grammaires et des analyseurs. Nous décrivons ensuite le processus de post-traitement, qui nous a permis d’extraire de nos analyses les informations demandées par la campagne EASy. Nous terminons par une évaluation quantitative de nos architectures.

pdf bib abs
Un analyseur LFG efficace pour le français : SXLFG
Pierre Boullier | Benoît Sagot | Lionel Clément
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous proposons un nouvel analyseur syntaxique, qui repose sur une variante du modèle Lexical-Functional Grammars (Grammaires Lexicales Fonctionnelles) ou LFG. Cet analyseur LFG accepte en entrée un treillis de mots et calcule ses structures fonctionnelles sur une forêt partagée. Nous présentons également les différentes techniques de rattrapage d’erreurs que nous avons mises en oeuvre. Puis nous évaluons cet analyseur sur une grammaire à large couverture du français dans le cadre d’une utilisation à grande échelle sur corpus variés. Nous montrons que cet analyseur est à la fois efficace et robuste.

pdf bib abs
Les Méta-RCG: description et mise en oeuvre
Benoît Sagot
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Nous présentons dans cet article un nouveau formalisme linguistique qui repose sur les Grammaires à Concaténation d’Intervalles (RCG), appelé Méta-RCG. Nous exposons tout d’abord pourquoi la non-linéarité permet une représentation adéquate des phénomènes linguistiques, et en particulier de l’interaction entre les différents niveaux de description. Puis nous présentons les Méta-RCG et les concepts linguistiques supplémentaires qu’elles mettent en oeuvre, tout en restant convertibles en RCG classiques. Nous montrons que les analyses classiques (constituants, dépendances, topologie, sémantique prédicat-arguments) peuvent être obtenues par projection partielle d’une analyse Méta-RCG complète. Enfin, nous décrivons la grammaire du français que nous développons dans ce nouveau formalisme et l’analyseur efficace qui en découle. Nous illustrons alors la notion de projection partielle sur un exemple.

2004

pdf bib abs
Les Grammaires à Concaténation d’Intervalles (RCG) comme formalisme grammatical pour la linguistique
Benoît Sagot | Pierre Boullier
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Le but de cet article est de montrer pourquoi les Grammaires à Concaténation d’Intervalles (Range Concatenation Grammars, ou RCG) sont un formalisme particulièrement bien adapté à la description du langage naturel. Nous expliquons d’abord que la puissance nécessaire pour décrire le langage naturel est celle de PTIME. Ensuite, parmi les formalismes grammaticaux ayant cette puissance d’expression, nous justifions le choix des RCG. Enfin, après un aperçu de leur définition et de leurs propriétés, nous montrons comment leur utilisation comme grammaires linguistiques permet de traiter des phénomènes syntagmatiques complexes, de réaliser simultanément l’analyse syntaxique et la vérification des diverses contraintes (morphosyntaxiques, sémantique lexicale), et de construire dynamiquement des grammaires linguistiques modulaires.

pdf bib
Morphology Based Automatic Acquisition of Large-coverage Lexica
Lionel Clément | Benoît Sagot | Bernard Lang
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)