Caio Corro

2024

pdf abs
Régression logistique parcimonieuse pour l’extraction automatique de règles de grammaire
Santiago Herrera | Caio Corro | Sylvain Kahane
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Nous proposons une nouvelle approche pour extraire et explorer des motifs grammaticaux à partir de corpus arborés, dans le but de construire des règles de grammaire syntaxique. Plus précisément, nous nous intéressons à deux phénomènes linguistiques, l’accord et l’ordre des mots, en utilisant un espace de recherche étendu et en accordant une attention particulière au classement des règles. Pour cela, nous utilisons un classifieur linéaire entraîné avec une pénalisation L1 pour identifier les caractéristiques les plus saillantes. Nous associons ensuite des informations quantitatives à chaque règle. Notre méthode permet de découvrir des règles de différentes granularités, certaines connues et d’autres moins. Dans ce travail, nous nous intéressons aux règles issues d’un corpus du français.

pdf abs
Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks
Santiago Herrera | Caio Corro | Sylvain Kahane
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic grammar rules from treebanks, in order to create an easy-to-understand corpus-based grammar. More specifically, we extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order, using a large search space and paying special attention to the ranking order of the extracted rules. For that, we use a linear classifier to extract the most salient features that predict the linguistic phenomena under study. We associate statistical information to each rule, and we compare the ranking of the model’s results to those of other quantitative and statistical measures. Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.

2023

pdf abs
Structural generalization in COGS: Supertagging is (almost) all you need
Alban Petit | Caio Corro | François Yvon
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In many Natural Language Processing applications, neural networks have been found to fail to generalize on out-of-distribution examples. In particular, several recent semantic parsing datasets have put forward important limitations of neural networks in cases where compositional generalization is required. In this work, we extend a neural graph-based parsing framework in several ways to alleviate this issue, notably: (1) the introduction of a supertagging step with valency constraints, expressed as an integer linear program; (2) the reduction of the graph prediction problem to the maximum matching problem; (3) the design of an incremental early-stopping training strategy to prevent overfitting. Experimentally, our approach significantly improves results on examples that require structural generalization in the COGS dataset, a known challenging benchmark for compositional generalization. Overall, these results confirm that structural constraints are important for generalization in semantic parsing.

pdf abs
On Graph-based Reentrancy-free Semantic Parsing
Alban Petit | Caio Corro
Transactions of the Association for Computational Linguistics, Volume 11

We propose a novel graph-based approach for semantic parsing that resolves two problems observed in the literature: (1) seq2seq models fail on compositional generalization tasks; (2) previous work using phrase structure parsers cannot cover all the semantic parses observed in treebanks. We prove that both MAP inference and latent tag anchoring (required for weakly-supervised learning) are NP-hard problems. We propose two optimization algorithms based on constraint smoothing and conditional gradient to approximately solve these inference problems. Experimentally, our approach delivers state-of-the-art results on GeoQuery, Scan, and Clevr, both for i.i.d. splits and for splits that test for compositional generalization.

pdf abs
On the inconsistency of separable losses for structured prediction
Caio Corro
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, that is minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.

pdf abs
A dynamic programming algorithm for span-based nested named-entity recognition in O(n²)
Caio Corro
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Span-based nested named-entity recognition (NER) has a cubic-time complexity using avariant of the CYK algorithm. We show that by adding a supplementary structural constraint on the search space, nested NER has a quadratic-time complexity, that is the same asymptotic complexity than the non-nested case. The proposed algorithm covers a large part of three standard English benchmarks and delivers comparable experimental results.

2022

pdf abs
Ré-ordonnancement via programmation dynamique pour l’adaptation cross-lingue d’un analyseur en dépendances (Sentence reordering via dynamic programming for cross-lingual dependency parsing )
Nicolas Devatine | Caio Corro | François Yvon
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Cet article s’intéresse au transfert cross-lingue d’analyseurs en dépendances et étudie des méthodes pour limiter l’effet potentiellement néfaste pour le transfert de divergences entre l’ordre des mots dans les langues source et cible. Nous montrons comment apprendre et implémenter des stratégies de réordonnancement, qui, utilisées en prétraitement, permettent souvent d’améliorer les performances des analyseurs dans un scénario de transfert « zero-shot ».

pdf abs
Un algorithme d’analyse sémantique fondée sur les graphes via le problème de l’arborescence généralisée couvrante (A graph-based semantic parsing algorithm via the generalized spanning arborescence problem)
Alban Petit | Caio Corro
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Nous proposons un nouvel algorithme pour l’analyse sémantique fondée sur les graphes via le problème de l’arborescence généralisée couvrante.

2021

pdf bib abs
Auto-encodeurs variationnels : contrecarrer le problème de posterior collapse grâce à la régularisation du décodeur (Variational auto-encoders : prevent posterior collapse via decoder regularization)
Alban Petit | Caio Corro
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Les auto-encodeurs variationnels sont des modèles génératifs utiles pour apprendre des représentations latentes. En pratique, lorsqu’ils sont supervisés pour des tâches de génération de textes, ils ont tendance à ignorer les variables latentes lors du décodage. Nous proposons une nouvelle méthode de régularisation fondée sur le dropout « fraternel » pour encourager l’utilisation de ces variables latentes. Nous évaluons notre approche sur plusieurs jeux de données et observons des améliorations dans toutes les configurations testées.

2020

pdf abs
Sur l’impact des contraintes structurelles pour l’analyse en dépendances profondes fondée sur les graphes (On the impact of structural constraints for graph-based deep dependency parsing)
Caio Corro
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Les algorithmes existants pour l’analyse en dépendances profondes fondée sur les graphes capables de garantir la connexité des structures produites ne couvrent pas les corpus du français. Nous proposons un nouvel algorithme qui couvre l’ensemble des structures possibles. Nous nous évaluons sur les corpus français FTB et Sequoia et observons un compromis entre la production de structures valides et la qualité des analyses.

pdf abs
Span-based discontinuous constituency parsing: a family of exact chart-based algorithms with time complexities from O(nˆ6) down to O(nˆ3)
Caio Corro
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We introduce a novel chart-based algorithm for span-based parsing of discontinuous constituency trees of block degree two, including ill-nested structures. In particular, we show that we can build variants of our parser with smaller search spaces and time complexities ranging from O(nˆ6) down to O(nˆ3). The cubic time variant covers 98% of constituents observed in linguistic treebanks while having the same complexity as continuous constituency parsers. We evaluate our approach on German and English treebanks (Negra, Tiger, and DPTB) and report state-of-the-art results in the fully supervised setting. We also experiment with pre-trained word embeddings and Bert-based neural networks.

2019

pdf abs
Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming
Caio Corro | Ivan Titov
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We treat projective dependency trees as latent variables in our probabilistic model and induce them in such a way as to be beneficial for a downstream task, without relying on any direct tree supervision. Our approach relies on Gumbel perturbations and differentiable dynamic programming. Unlike previous approaches to latent tree learning, we stochastically sample global structures and our parser is fully differentiable. We illustrate its effectiveness on sentiment analysis and natural language inference tasks. We also study its properties on a synthetic structure induction task. Ablation studies emphasize the importance of both stochasticity and constraining latent structures to be projective trees.

2017

pdf abs
Efficient Discontinuous Phrase-Structure Parsing via the Generalized Maximum Spanning Arborescence
Caio Corro | Joseph Le Roux | Mathieu Lacroix
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We present a new method for the joint task of tagging and non-projective dependency parsing. We demonstrate its usefulness with an application to discontinuous phrase-structure parsing where decoding lexicalized spines and syntactic derivations is performed jointly. The main contributions of this paper are (1) a reduction from joint tagging and non-projective dependency parsing to the Generalized Maximum Spanning Arborescence problem, and (2) a novel decoding algorithm for this problem through Lagrangian relaxation. We evaluate this model and obtain state-of-the-art results despite strong independence assumptions.

pdf
Transforming Dependency Structures to LTAG Derivation Trees
Caio Corro | Joseph Le Roux
Proceedings of the 13th International Workshop on Tree Adjoining Grammars and Related Formalisms