2024
pdf
abs
NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data
Sergei Bogdanov
|
Alexandre Constantin
|
Timothée Bernard
|
Benoit Crabbé
|
Etienne P Bernard
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) have shown impressive abilities in data annotation, opening the way for new approaches to solve classic NLP problems. In this paper, we show how to use LLMs to create NuNER, a compact language representation model specialized in the Named Entity Recognition (NER) task. NuNER can be fine-tuned to solve downstream NER problems in a data-efficient way, outperforming similar-sized foundation models in the few-shot regime and competing with much larger LLMs. We find that the size and entity-type diversity of the pre-training dataset are key to achieving good performance. We view NuNER as a member of the broader family of task-specific foundation models, recently unlocked by LLMs. NuNER and NuNER’s dataset are open-sourced with MIT License.
pdf
abs
Auto-correction et oracle dynamique : certains effets n’apparaissent qu’à taille réduite
Fang Zhao
|
Timothée Bernard
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position
Nous étudions l’effet de la capacité d’auto-correction, de l’utilisation d’un oracle dynamique et de la taille du modèle, sur la performance d’un analyseur joint (morpho)syntaxe/sémantique. Nous montrons qu’avec un modèle de taille réduite, la possibilité d’auto-correction est nuisible en sémantique mais bénéfique en syntaxe, tandis que l’utilisation d’un oracle dynamique augmente la performance en sémantique. Nous constatons également que ces effets sont souvent atténués pour des modèles de taille plus importante.
pdf
abs
The Emergence of High-Level Semantics in a Signaling Game
Timothée Bernard
|
Timothee Mickus
|
Hiroya Takamura
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)
The symbol grounding problem—how to connect a symbolic system to the outer world—is a longstanding question in AI that has recently gained prominence with the progress made in NLP in general and surrounding large language models in particular. In this article, we study the emergence of semantic categories in the communication protocol developed by neural agents involved in a well-established type of signaling game. In its basic form, the game requires one agent to retrieve an image based on a message produced by a second agent. We first show that the agents are able to, and do, learn to communicate high-level semantic concepts rather than low-level features of the images even from very indirect training signal to that end. Second, we demonstrate that the introduction of an adversarial agent in the game fosters the emergence of semantics by producing an appropriate training signal when no other method is available.
pdf
abs
Improving Word Sense Induction through Adversarial Forgetting of Morphosyntactic Information
Deniz Ekin Yavas
|
Timothée Bernard
|
Laura Kallmeyer
|
Benoît Crabbé
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)
This paper addresses the problem of word sense induction (WSI) via clustering of word embeddings. It starts from the hypothesis that contextualized word representations obtained from pre-trained language models (LMs), while being a valuable source for WSI, encode more information than what is necessary for the identification of word senses and some of this information affect the performance negatively in unsupervised settings. We investigate whether using contextualized representations that are invariant to these ‘nuisance features’ can increase WSI performance. For this purpose, we propose an adaptation of the adversarial training framework proposed by Jaiswal et al. (2020) to erase specific information from the representations of LMs, thereby creating feature-invariant representations. We experiment with erasing (i) morphological and (ii) syntactic features. The results of subsequent clustering for WSI show that these features indeed act like noise: Using feature-invariant representations, compared to using the original representations, increases clustering-based WSI performance. Furthermore, we provide an in-depth analysis of how the information about the syntactic and morphological features of words relate to and affect WSI performance.
2023
pdf
abs
So many design choices: Improving and interpreting neural agent communication in signaling games
Timothée Bernard
|
Timothee Mickus
Findings of the Association for Computational Linguistics: ACL 2023
Emergent language games are experimental protocols designed to model how communication may arise among a group of agents. In this paper, we focus on how to improve performances of neural agents playing a signaling game: a sender is exposed to an image and generates a sequence of symbols that is transmitted to a receiver, which uses it to distinguish between two images, one that is semantically related to the original image, and one that is not. We consider multiple design choices, such as pretraining the visual components of the agents, introducing regularization terms, how to sample training items from the dataset, and we study how these different choices impact the behavior and performances of the agents. To that end, we introduce a number of automated metrics to measure the properties of the emergent language. We find that some implementation choices are always beneficial, and that the information that is conveyed by the agents’ messages is shaped not only by the game, but also by the overall design of the agents as well as seemingly unrelated implementation choices.
pdf
abs
Auto-apprentissage et renforcement pour une analyse jointe sur données disjointes : étiquetage morpho-syntaxique et analyse syntaxique
Fang Zhao
|
Timothée Bernard
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : travaux de recherche originaux -- articles courts
Cet article se penche sur l’utilisation de données disjointes pour entraîner un système d’analyse jointe du langage naturel. Dans cette étude exploratoire, nous entraînons un système à prédire un étiquetage morpho-syntaxique et une analyse syntaxique en dépendances à partir de phrases annotées soit pour l’une de ces tâches, soit pour l’autre. Deux méthodes sont considérées : l’auto-apprentissage et l’apprentissage par renforcement, pour lequel nous définissons une fonction de récompense encourageant le système à effectuer des prédictions même sans supervision. Nos résultats indiquent de bonnes performances dans le cas où les données disjointes sont issues d’un même domaine, mais sont moins satisfaisants dans le cas contraire. Nous identifions des limitations de notre implémentation actuelle et proposons en conséquence des pistes d’amélioration.
2021
pdf
abs
Multiple Tasks Integration: Tagging, Syntactic and Semantic Parsing as a Single Task
Timothée Bernard
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Departing from both sequential pipelines and monotask systems, we propose Multiple Tasks Integration (MTI), a multitask paradigm orthogonal to weight sharing. The essence of MTI is to process the input iteratively but concurrently at multiple levels of analysis, where each decision is based on all of the structures that are already inferred and free from usual ordering constraints. We illustrate MTI with a system that performs part-of-speech tagging, syntactic dependency parsing and semantic dependency parsing. We observe that both the use of reinforcement learning and the release from sequential constraints are beneficial to the quality of the syntactic and semantic parses. We also observe that our model adopts an easy-first strategy that consists, on average, of predicting shorter dependencies before longer ones, but that syntax is not always tackled before semantics.
pdf
abs
Intégration de tâches: étiquetage morpho-syntaxique, analyse syntaxique et analyse sémantique traités comme une tâche unique (Multiple Tasks Integration: Tagging, Syntactic and Semantic Parsing as a Single Task )
Timothée Bernard
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale
Nous présentons des résumés en français et en anglais de l’article (Bernard, 2021), présenté lors de la conférence 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021). L’article décrit l’intégration de tâches, un ensemble de principes orthogonaux au partage de paramètres dont le but est de maximiser l’interaction entre différentes tâches. L’intégration de tâches est illustrée avec un système analysant de manière jointe les niveaux morpho-syntaxiques, syntaxiques et sémantiques. La stratégie adoptée par ce système, entraîné par renforcement, est aussi analysée.
pdf
abs
Tabouid: un jeu de langage et de culture générale généré à partir de Wikipédia (Tabouid: a Wikipedia-based word guessing game)
Timothée Bernard
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale
Nous présentons des résumés en français et en anglais de l’article (Bernard, 2020), présenté lors de la conférence 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). L’article détaille comment un éventail de techniques relativement simples de TAL et d’apprentissage automatique peuvent être combinées pour générer à partir de Wikipédia le contenu d’un jeu de langage et de culture générale. L’article peut être vu comme définissant un projet stimulant pour des étudiant·e·s en TAL et le jeu lui-même a effectivement été implémenté sous la forme de Tabouid, une application Android et iOS.
2020
pdf
abs
What Meaning-Form Correlation Has to Compose With: A Study of MFC on Artificial and Natural Language
Timothee Mickus
|
Timothée Bernard
|
Denis Paperno
Proceedings of the 28th International Conference on Computational Linguistics
Compositionality is a widely discussed property of natural languages, although its exact definition has been elusive. We focus on the proposal that compositionality can be assessed by measuring meaning-form correlation. We analyze meaning-form correlation on three sets of languages: (i) artificial toy languages tailored to be compositional, (ii) a set of English dictionary definitions, and (iii) a set of English sentences drawn from literature. We find that linguistic phenomena such as synonymy and ungrounded stop-words weigh on MFC measurements, and that straightforward methods to mitigate their effects have widely varying results depending on the dataset they are applied to. Data and code are made publicly available.
pdf
abs
Mandarinograd: A Chinese Collection of Winograd Schemas
Timothée Bernard
|
Ting Han
Proceedings of the Twelfth Language Resources and Evaluation Conference
This article introduces Mandarinograd, a corpus of Winograd Schemas in Mandarin Chinese. Winograd Schemas are particularly challenging anaphora resolution problems, designed to involve common sense reasoning and to limit the biases and artefacts commonly found in natural language understanding datasets. Mandarinograd contains the schemas in their traditional form, but also as natural language inference instances (ENTAILMENT or NO ENTAILMENT pairs) as well as in their fully disambiguated candidate forms. These two alternative representations are often used by modern solvers but existing datasets present automatically converted items that sometimes contain syntactic or semantic anomalies. We detail the difficulties faced when building this corpus and explain how weavoided the anomalies just mentioned. We also show that Mandarinograd is resistant to a statistical method based on a measure of word association.
pdf
abs
Tabouid: a Wikipedia-based word guessing game
Timothée Bernard
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
We present Tabouid, a word-guessing game automatically generated from Wikipedia. Tabouid contains 10,000 (virtual) cards in English, and as many in French, covering not only words and linguistic expressions but also a variety of topics including artists, historical events or scientific concepts. Each card corresponds to a Wikipedia article, and conversely, any article could be turned into a card. A range of relatively simple NLP and machine-learning techniques are effectively integrated into a two-stage process. First, a large subset of Wikipedia articles are scored - this score estimates the difficulty, or alternatively, the playability of the page. Then, the best articles are turned into cards by selecting, for each of them, a list of banned words based on its content. We believe that the game we present is more than mere entertainment and that, furthermore, this paper has pedagogical potential.
2018
pdf
abs
Fine-Grained Discourse Structures in Continuation Semantics
Timothée Bernard
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue
In this work, we are interested in the computation of logical representations of discourse. We argue that all discourse connectives are anaphors obeying different sets of constraints and show how this view allows one to account for the semantically parenthetical use of attitude verbs and verbs of report (e.g., think, say) and for sequences of conjunctions (A CONJ_1 B CONJ_2 C). We implement this proposal in event semantics using de Groote (2006)’s dynamic framework.
2017
pdf
abs
Une interprétation probabiliste des informations de factivité (Factuality information as sets of probabilities)
Timothée Bernard
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts
Nous présentons une nouvelle formalisation de la factivité, la dimension représentant le degré de croyance qu’une source – l’auteur ou tout autre agent mentionné dans un texte – accorde à une éventualité donnée. Nous insistons sur l’aspect dynamique de cette notion ainsi que sur ses interactions avec la structure discursive. Nous montrons comment une interprétation en termes d’ensembles de probabilités permet de s’affranchir des principaux problèmes que posait la formalisation utilisée dans les travaux précédents au calcul d’une factivité cohérente à l’échelle du texte dans sa totalité.
2016
pdf
abs
Conjonctions de subordination, verbes de dire et d’attitude propositionnelle : une modélisation STAG pour le discours (Modelling Subordinate Conjunctions, Attitude Verbs and Reporting Verbs in STAG: a Discourse Perspective)
Timothée Bernard
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 3 : RECITAL
Nous proposons une nouvelle modélisation en grammaire d’arbres adjoints synchrone (STAG) syntaxe/sémantique pour les conjonctions de subordination (ConjSub) et les verbes de dire et d’attitude propositionnelle (VAP ; dire, penser, croire, etc.). Cette modélisation, plus riche que les modélisations traditionnelles, est conçue pour l’analyse du discours et fondée sur l’observation que ces deux catégories sont loin d’être homogènes. En effet, des travaux antérieurs ont montré d’une part que les occurrences de ConjSub pouvaient être divisées en deux classes aux propriétés syntaxiques et sémantiques différentes, d’autre part que les VAP présentaient en discours deux usages distincts : évidentiel et intentionnel. Notre proposition vise donc à rendre compte précisément de ces différences tout en modélisant les interactions entre VAP et ConjSub.
pdf
Modelling Discourse in STAG: Subordinate Conjunctions and Attributing Phrases
Timothée Bernard
|
Laurence Danlos
Proceedings of the 12th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+12)