Ophélie Lacroix

2022

pdf abs
DDisCo: A Discourse Coherence Dataset for Danish
Linea Flansmose Mikkelsen | Oliver Kinch | Anders Jess Pedersen | Ophélie Lacroix
Proceedings of the Thirteenth Language Resources and Evaluation Conference

To date, there has been no resource for studying discourse coherence on real-world Danish texts. Discourse coherence has mostly been approached with the assumption that incoherent texts can be represented by coherent texts in which sentences have been shuffled. However, incoherent real-world texts rarely resemble that. We thus present DDisCo, a dataset including text from the Danish Wikipedia and Reddit annotated for discourse coherence. We choose to annotate real-world texts instead of relying on artificially incoherent text for training and testing models. Then, we evaluate the performance of several methods, including neural networks, on the dataset.

2021

Automatic coreference resolution is understudied in Danish even though most of the Danish Dependency Treebank (Buch-Kromann, 2003) is annotated with coreference relations. This paper describes a conversion of its partial, yet well-documented, coreference relations into coreference clusters and the training and evaluation of coreference models on this data. To the best of our knowledge, these are the first publicly available, neural coreference models for Danish. We also present a new entity linking annotation on the dataset using WikiData identifiers, a named entity disambiguation (NED) dataset, and a larger automatically created NED dataset enabling wikily supervised NED models. The entity linking annotation is benchmarked using a state-of-the-art neural entity disambiguation model.

pdf abs
DaNLP: An open-source toolkit for Danish Natural Language Processing
Amalie Brogaard Pauli | Maria Barrett | Ophélie Lacroix | Rasmus Hvingelby
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We present an open-source toolkit for Danish Natural Language Processing, enabling easy access to Danish NLP’s latest advancements. The toolkit features wrapper-functions for loading models and datasets in a unified way using third-party NLP frameworks. The toolkit is developed to enhance community building, understanding the need from industry and knowledge sharing. As an example of this, we present Angry Tweets: An Annotation Game to create awareness of Danish NLP and create a new sentiment-annotated dataset.

2020

pdf abs
Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses
Simon Flachs | Ophélie Lacroix | Helen Yannakoudakis | Marek Rei | Anders Søgaard
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Evaluation of grammatical error correction (GEC) systems has primarily focused on essays written by non-native learners of English, which however is only part of the full spectrum of GEC applications. We aim to broaden the target domain of GEC and release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays, which we show presents a challenge to state-of-the-art GEC systems. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains. We hope this work shall facilitate the development of open-domain GEC models that generalize to different topics and genres.

2019

pdf abs
Noisy Channel for Low Resource Grammatical Error Correction
Simon Flachs | Ophélie Lacroix | Anders Søgaard
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

This paper describes our contribution to the low-resource track of the BEA 2019 shared task on Grammatical Error Correction (GEC). Our approach to GEC builds on the theory of the noisy channel by combining a channel model and language model. We generate confusion sets from the Wikipedia edit history and use the frequencies of edits to estimate the channel model. Additionally, we use two pre-trained language models: 1) Google’s BERT model, which we fine-tune for specific error types and 2) OpenAI’s GPT-2 model, utilizing that it can operate with previous sentences as context. Furthermore, we search for the optimal combinations of corrections using beam search.

pdf
Dependency Parsing as Sequence Labeling with Head-Based Encoding and Multi-Task Learning
Ophélie Lacroix
Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019)

pdf abs
A Simple and Robust Approach to Detecting Subject-Verb Agreement Errors
Simon Flachs | Ophélie Lacroix | Marek Rei | Helen Yannakoudakis | Anders Søgaard
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

While rule-based detection of subject-verb agreement (SVA) errors is sensitive to syntactic parsing errors and irregularities and exceptions to the main rules, neural sequential labelers have a tendency to overfit their training data. We observe that rule-based error generation is less sensitive to syntactic parsing errors and irregularities than error detection and explore a simple, yet efficient approach to getting the best of both worlds: We train neural sequential labelers on the combination of large volumes of silver standard data, obtained through rule-based error generation, and gold standard data. We show that our simple protocol leads to more robust detection of SVA errors on both in-domain and out-of-domain data, as well as in the context of other errors and long-distance dependencies; and across four standard benchmarks, the induced model on average achieves a new state of the art.

2018

pdf abs
Investigating NP-Chunking with Universal Dependencies for English
Ophélie Lacroix
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Chunking is a pre-processing task generally dedicated to improving constituency parsing. In this paper, we want to show that universal dependency (UD) parsing can also leverage the information provided by the task of chunking even though annotated chunks are not provided with universal dependency trees. In particular, we introduce the possibility of deducing noun-phrase (NP) chunks from universal dependencies, focusing on English as a first example. We then demonstrate how the task of NP-chunking can benefit PoS-tagging in a multi-task learning setting – comparing two different strategies – and how it can be used as a feature for dependency parsing in order to learn enriched models.

pdf abs
Automatically Selecting the Best Dependency Annotation Design with Dynamic Oracles
Guillaume Wisniewski | Ophélie Lacroix | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

This work introduces a new strategy to compare the numerous conventions that have been proposed over the years for expressing dependency structures and discover the one for which a parser will achieve the highest parsing performance. Instead of associating each sentence in the training set with a single gold reference we propose to consider a set of references encoding alternative syntactic representations. Training a parser with a dynamic oracle will then automatically select among all alternatives the reference that will be predicted with the highest accuracy. Experiments on the UD corpora show the validity of this approach.

2017

pdf
A Systematic Comparison of Syntactic Representations of Dependency Parsing
Guillaume Wisniewski | Ophélie Lacroix
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf abs
Cross-lingual and cross-domain discourse segmentation of entire documents
Chloé Braud | Ophélie Lacroix | Anders Søgaard
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Discourse segmentation is a crucial step in building end-to-end discourse parsers. However, discourse segmenters only exist for a few languages and domains. Typically they only detect intra-sentential segment boundaries, assuming gold standard sentence and token segmentation, and relying on high-quality syntactic parses and rich heuristics that are not generally available across languages and domains. In this paper, we propose statistical discourse segmenters for five languages and three domains that do not rely on gold pre-annotations. We also consider the problem of learning discourse segmenters when no labeled data is available for a language. Our fully supervised system obtains 89.5% F1 for English newswire, with slight drops in performance on other domains, and we report supervised and unsupervised (cross-lingual) results for five languages in total.

pdf abs
Does syntax help discourse segmentation? Not so much
Chloé Braud | Ophélie Lacroix | Anders Søgaard
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Discourse segmentation is the first step in building discourse parsers. Most work on discourse segmentation does not scale to real-world discourse parsing across languages, for two reasons: (i) models rely on constituent trees, and (ii) experiments have relied on gold standard identification of sentence and token boundaries. We therefore investigate to what extent constituents can be replaced with universal dependencies, or left out completely, as well as how state-of-the-art segmenters fare in the absence of sentence boundaries. Our results show that dependency information is less useful than expected, but we provide a fully scalable, robust model that only relies on part-of-speech information, and show that it performs well across languages in the absence of any gold-standard annotation.

2016

pdf bib abs
Apprentissage d’analyseur en dépendances cross-lingue par projection partielle de dépendances (Cross-lingual learning of dependency parsers from partially projected dependencies )
Ophélie Lacroix | Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Cet article présente une méthode simple de transfert cross-lingue de dépendances. Nous montrons tout d’abord qu’il est possible d’apprendre un analyseur en dépendances par transition à partir de données partiellement annotées. Nous proposons ensuite de construire de grands ensembles de données partiellement annotés pour plusieurs langues cibles en projetant les dépendances via les liens d’alignement les plus sûrs. En apprenant des analyseurs pour les langues cibles à partir de ces données partielles, nous montrons que cette méthode simple obtient des performances qui rivalisent avec celles de méthodes état-de-l’art récentes, tout en ayant un coût algorithmique moindre.

pdf
Frustratingly Easy Cross-Lingual Transfer for Transition-Based Dependency Parsing
Ophélie Lacroix | Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Cross-lingual Dependency Transfer : What Matters? Assessing the Impact of Pre- and Post-processing
Ophélie Lacroix | Guillaume Wisniewski | François Yvon
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

2015

pdf abs
CDGFr, un corpus en dépendances non-projectives pour le français
Denis Béchet | Ophélie Lacroix
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans le cadre de l’analyse en dépendances du français, le phénomène de la non-projectivité est peu pris en compte, en majeure partie car les donneés sur lesquelles sont entraînés les analyseurs représentent peu ou pas ces cas particuliers. Nous présentons, dans cet article, un nouveau corpus en dépendances pour le français, librement disponible, contenant un nombre substantiel de dépendances non-projectives. Ce corpus permettra d’étudier et de mieux prendre en compte les cas de non-projectivité dans l’analyse du français.

2014

pdf
A Three-Step Transition-Based System for Non-Projective Dependency Parsing
Ophélie Lacroix | Denis Béchet
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf abs
Validation Issues induced by an Automatic Pre-Annotation Mechanism in the Building of Non-projective Dependency Treebanks
Ophélie Lacroix | Denis Béchet
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In order to build large dependency treebanks using the CDG Lab, a grammar-based dependency treebank development tool, an annotator usually has to fill a selection form before parsing. This step is usually necessary because, otherwise, the search space is too big for long sentences and the parser fails to produce at least one solution. With the information given by the annotator on the selection form the parser can produce one or several dependency structures and the annotator can proceed by adding positive or negative annotations on dependencies and launching iteratively the parser until the right dependency structure has been found. However, the selection form is sometimes difficult and long to fill because the annotator must have an idea of the result before parsing. The CDG Lab proposes to replace this form by an automatic pre-annotation mechanism. However, this model introduces some issues during the annotation phase that do not exist when the annotator uses a selection form. The article presents those issues and proposes some modifications of the CDG Lab in order to use effectively the automatic pre-annotation mechanism.