Pascal Amsili

2024

pdf abs
FReND: A French Resource of Negation Data
Hafida Le Cloirec - Ait Yahya | Olga Seminck | Pascal Amsili
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

FReND is a freely available corpus of French language in which negations are hand-annotated. Negations are annotated by their cues and scopes. Comprising 590K tokens and over 8.9K negations, it is the largest dataset available for French. A variety of types of textual genres are covered: literature, blog posts, Wikipedia articles, political debates, clinical reports and newspaper articles. As the understanding of negation is not yet mastered by current state of the art AI-models, FReND is not only a valuable resource for linguistic research into negation, but also as training data for AI tasks such as negation detection.

2023

pdf abs
Uniformité de la densité informationnelle: le cas du redoublement du sujet
Yiming Liang | Pascal Amsili | Heather Burnett
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs

Nous présentons les résultats d’une expérience visant à savoir si la densité d’information (ou de surprise) affecte le redoublement du sujet dans des conversations spontanées. En utilisant la version française de GPT, nous estimons la surprise lexicale du sujet NP étant donné un contexte précédent et vérifions si la surprise du sujet affecte son redoublement. L’analyse de régression à effet mixte montre que, en plus des facteurs qui ont été montrés comme affectant le redoublement du sujet dans la littérature, la prévisibilité du sujet nominal est un prédicteur important du non-redoublement. Les sujets nominaux moins prédictibles tendent à être redoublés par rapport à ceux qui sont plus prédictibles. Notre travail confirme l’intérêt de l’hypothèse de l’Uniformité de la densité informationnelle (UID) pour le français et illustre l’opérationalisation de la densité informationnelle à l’aide de grands modèles neuronaux de langage.

pdf abs
The Self-Contained Negation Test Set
David Kletz | Pascal Amsili | Marie Candito
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Several methodologies have recently been proposed to evaluate the ability of Pretrained Language Models (PLMs) to interpret negation. In this article, we build on Gubelmann and Handschuh (2022), which studies the modification of PLMs’ predictions as a function of the polarity of inputs, in English. Crucially, this test uses “self-contained” inputs ending with a masked position: depending on the polarity of a verb in the input, a particular token is either semantically ruled out or allowed at the masked position. By replicating Gubelmann and Handschuh (2022) experiments, we have uncovered flaws that weaken the conclusions that can be drawn from this test. We thus propose an improved version, the Self-Contained Neg Test, which is more controlled, more systematic, and entirely based on examples forming minimal pairs varying only in the presence or absence of verbal negation in English. When applying our test to the roberta and bert base and large models, we show that only roberta-large shows trends that match the expectations, while bert-base is mostly insensitive to negation. For all the tested models though, in a significant number of test instances the top-1 prediction remains the token that is semantically forbidden by the context, which shows how much room for improvement remains for a proper treatment of the negation phenomenon.

pdf abs
Probing structural constraints of negation in Pretrained Language Models
David Kletz | Marie Candito | Pascal Amsili
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Contradictory results about the encoding of the semantic impact of negation in pretrained language models (PLMs) have been drawn recently (e.g. Kassner and Schütze (2020); Gubelmann and Handschuh (2022)).In this paper we focus rather on the way PLMs encode negation and its formal impact, through the phenomenon of the Negative Polarity Item (NPI) licensing in English.More precisely, we use probes to identify which contextual representations best encode 1) the presence of negation in a sentence, and 2) the polarity of a neighboring masked polarity item. We find that contextual representations of tokens inside the negation scope do allow for (i) a better prediction of the presence of “not” compared to those outside the scope and (ii) a better prediction of the right polarity of a masked polarity item licensed by “not”, although the magnitude of the difference varies from PLM to PLM. Importantly, in both cases the trend holds even when controlling for distance to “not”.This tends to indicate that the embeddings of these models do reflect the notion of negation scope, and do encode the impact of negation on NPI licensing. Yet, further control experiments reveal that the presence of other lexical items is also better captured when using the contextual representation of a token within the same syntactic clause than outside from it, suggesting that PLMs simply capture the more general notion of syntactic clause.

2022

pdf abs
Investigating associative, switchable and negatable Winograd items on renewed French data sets
Xiaoou Wang | Olga Seminck | Pascal Amsili
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

The Winograd Schema Challenge (WSC) consists of a set of anaphora resolution problems resolvable only by reasoning about world knowledge. This article describes the update of the existing French data set and the creation of three subsets allowing for a more robust, fine-grained evaluation protocol of WSC in French (FWSC) : an associative subset (items easily resolvable with lexical co-occurrence), a switchable subset (items where the inversion of two keywords reverses the answer) and a negatable subset (items where applying negation on its verb reverses the answer). Experiences on these data sets with CamemBERT reach SOTA performances. Our evaluation protocol showed in addition that the higher performance could be explained by the existence of associative items in FWSC. Besides, increasing the size of training corpus improves the model’s performance on switchable items while the impact of larger training corpus remains small on negatable items.

2021

pdf
Inter-clausal Anaphora in Chinese Conditionals: a Multi-factorial Analysis
Shunting Chen | Pascal Amsili | Yiming Liang
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

2020

pdf abs
VerNom : une base de paires morphologiques acquise sur très gros corpus (VerNom : a French derivational database acquired on a massive corpus)
Alice Missud | Pascal Amsili | Florence Villoing
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Alors qu’une part active de la recherche en morphologie dérivationnelle s’intéresse à la compétition qui oppose les suffixations construisant des noms d’événement à partir de verbes (-age, -ment, -ion, -ure, -ance, -ade, -aison), l’accès à des données en large quantité devient nécessaire pour l’application de méthodes quantitatives. Dans l’optique de réunir des paires de verbes et de noms morphologiquement reliés dans le cadre de ces suffixations rivales, nous présentons VerNom, une base morphologique comprenant 25 857 paires verbe-nom, construite automatiquement à partir d’un corpus massif issu du web.

2019

pdf abs
Modèles de langue appliqués aux schémas Winograd français (Language Models applied to French Winograd Schemas)
Olga Seminck | Vincent Segonne | Pascal Amsili
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

Les schémas Winograd sont des problèmes de résolution d’anaphores conçus pour nécessiter un raisonnement sur des connaissances du monde. Par construction, ils sont insensibles à des statistiques simples (co-occurrences en corpus). Pourtant, aujourd’hui, les systèmes état de l’art pour l’anglais se basent sur des modèles de langue pour résoudre les schémas (Trinh & Le, 2018). Nous présentons dans cet article une étude visant à tester des modèles similaires sur les schémas en français. Cela nous conduit à revenir sur les métriques d’évaluation utilisées dans la communauté pour les schémas Winograd. Les performances que nous obtenons, surtout comparées à celles de Amsili & Seminck (2017b), suggèrent que l’approche par modèle de langue des schémas Winograd reste limitée, sans doute en partie à cause du fait que les modèles de langue encodent très difficilement le genre de raisonnement nécessaire à la résolution des schémas Winograd.

pdf abs
Résolution des coréférences neuronale : une approche basée sur les têtes (Neural coreference resolution : a head-based approach)
Quentin Gliosca | Pascal Amsili
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

L’avènement des approches neuronales de bout en bout a entraîné une rupture dans la façon dont était jusqu’à présent envisagée et implémentée la tâche de résolution des coréférences. Nous pensons que cette rupture impose de remettre en question la conception des mentions en termes de syntagmes maximaux, au moins pour certaines applications dont nous donnons deux exemples. Dans cette perspective, nous proposons une nouvelle formulation de la tâche, basée sur les têtes, accompagnée d’une adaptation du modèle de Lee et al. (2017) qui l’implémente.

2018

pdf
A Gold Anaphora Annotation Layer on an Eye Movement Corpus
Olga Seminck | Pascal Amsili
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
A Computational Model of Human Preferences for Pronoun Resolution
Olga Seminck | Pascal Amsili
Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics

We present a cognitive computational model of pronoun resolution that reproduces the human interpretation preferences of the Subject Assignment Strategy and the Parallel Function Strategy. Our model relies on a probabilistic pronoun resolution system trained on corpus data. Factors influencing pronoun resolution are represented as features weighted by their relative importance. The importance the model gives to the preferences is in line with psycholinguistic studies. We demonstrate the cognitive plausibility of the model by running it on experimental items and simulating antecedent choice and reading times of human participants. Our model can be used as a new means to study pronoun resolution, because it captures the interaction of preferences.

pdf abs
A Google-Proof Collection of French Winograd Schemas
Pascal Amsili | Olga Seminck
Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017)

This article presents the first collection of French Winograd Schemas. Winograd Schemas form anaphora resolution problems that can only be resolved with extensive world knowledge. For this reason the Winograd Schema Challenge has been proposed as an alternative to the Turing Test. A very important feature of Winograd Schemas is that it should be impossible to resolve them with statistical information about word co-occurrences: they should be Google-proof. We propose a measure of Google-proofness based on Mutual Information, and demonstrate the method on our collection of French Winograd Schemas.

pdf abs
Schémas Winograd en français: une étude statistique et comportementale (Winograd schemas in French : a statistical and behavioral study)
Pascal Amsili | Olga Seminck
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Nous présentons dans cet article une collection de schémas Winograd en français, adaptée de la liste proposée par Levesque et al. (2012) pour l’anglais. Les schémas Winograd sont des problèmes de résolution d’anaphore conçus pour être IA-complets. Nous montrons que notre collection vérifie deux propriétés cruciales : elle est robuste vis-à-vis de méthodes statistiques simples (“Google-proof”), tout en étant largement dépourvue d’ambiguïté pour les sujets humains que nous avons testés.

2014

The Asfalda project aims to develop a French corpus with frame-based semantic annotations and automatic tools for shallow semantic analysis. We present the first part of the project: focusing on a set of notional domains, we delimited a subset of English frames, adapted them to French data when necessary, and developed the corresponding French lexicon. We believe that working domain by domain helped us to enforce the coherence of the resulting resource, and also has the advantage that, though the number of frames is limited (around a hundred), we obtain full coverage within a given domain.

pdf
Learning simulation of nominal/verbal contexts through n-grams (Simulation de l’apprentissage des contextes nominaux/verbaux par n-grammes) [in French]
Perrine Brusini | Pascal Amsili | Emmanuel Chemla | Anne Christophe
Proceedings of TALN 2014 (Volume 2: Short Papers)

2011

pdf abs
French TimeBank : un corpus de référence sur la temporalité en français (French TimeBank: a reference corpus on temporality in French)
André Bittar | Pascal Amsili | Pascal Denis
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article a un double objectif : d’une part, il s’agit de présenter à la communauté un corpus récemment rendu public, le French Time Bank (FTiB), qui consiste en une collection de textes journalistiques annotés pour les temps et les événements selon la norme ISO-TimeML ; d’autre part, nous souhaitons livrer les résultats et réflexions méthodologiques que nous avons pu tirer de la réalisation de ce corpus de référence, avec l’idée que notre expérience pourra s’avérer profitable au-delà de la communauté intéressée par le traitement de la temporalité.

pdf
French TimeBank: An ISO-TimeML Annotated Reference Corpus
André Bittar | Pascal Amsili | Pascal Denis | Laurence Danlos
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2002

pdf abs
Discours et compositionnalité
Laurent Roussarie | Pascal Amsili
Actes de la 9ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Partant du principe que certaines phrases peuvent réaliser plusieurs actes de langage, i.e., dans une interface sémantique–pragmatique, plusieurs constituants de discours séparés, nous proposons, dans le cadre de la SDRT, un algorithme de construction de représentations sémantiques qui prend en compte tous les aspects discursifs dès que possible et de façon compositionnelle.