This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
LaurianeAufrant
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Ce travail propose de revisiter les approches de liage d’entités au regard de la tâche très prochequ’est la résolution de coréférence. Nous observons en effet différentes configurations (appuyéespar l’exemple) où le reste de la chaîne de coréférence peut fournir des indices utiles pour améliorerla désambiguïsation. Guidés par ces motivations théoriques, nous menons une analyse d’erreursaccompagnée d’expériences oracles qui confirment le potentiel de stratégies de combinaison deprédictions au sein de la chaîne de coréférence (jusqu’à 4.3 F1 sur les mentions coréférentes en anglais). Nousesquissons alors une première preuve de concept de combinaison par vote, en explorant différentesheuristiques de pondération, qui apporte des gains modestes mais interprétables.
Named entity recognition as it is traditionally envisioned excludes in practice a significant part of the entities of potential interest for real-word applications: nested, discontinuous, non-named entities. Despite various attempts to broaden their coverage, subsequent annotation schemes have achieved little adoption in the literature and the most restrictive variant of NER remains the default. This is partly due to the complexity of those annotations and their format. In this paper, we introduce a new annotation scheme that offers higher comprehensiveness while preserving simplicity, together with an annotation tool to implement that scheme. We also release the corpus UkraiNER, comprised of 10,000 French sentences in the geopolitical news domain and manually annotated with comprehensive entity recognition. Our baseline experiments on UkraiNER provide a first point of comparison to facilitate future research (82 F1 for comprehensive entity recognition, 87 F1 when focusing on traditional nested NER), as well as various insights on the composition and challenges that this corpus presents for state-of-the-art named entity recognition models.
Cette contribution présente les travaux du comité européen de standardisation de l’IA en matière de TALN. Le comité CEN-CENELEC JTC 21 a été mandaté par la Commission européenne pour développer les standards techniques permettant la mise en application du futur règlement européen sur l’IA : performance, robustesse, transparence, etc. Dans ce contexte, le TALN a été identifié comme un volet spécifique de l’IA, méritant ses propres outils, critères et bonnes pratiques. Ce constat a mené au développement d’une feuille de route ambitieuse incluant plusieurs projets de standardisation en TALN.À ce jour, un premier travail d’inventaire et de définition des tâches de TALN a déjà été initié, et la rédaction d’un standard sur les métriques d’évaluation débute. Ces travaux ont aussi été l’occasion d’une réflexion plus large sur les besoins en standardisation du TALN, incluant une taxonomie des méthodes et des travaux sur les formats d’annotation et l’interopérabilité.
While standardization is a well-established activity in other scientific fields such as telecommunications, networks or multimedia, in the field of AI and more specifically NLP it is still at its dawn. In this paper, we explore how various aspects of NLP (evaluation, data, tasks...) lack standards and how that can impact science, but also the society, the industry, and regulations. We argue that the numerous initiatives to rationalize the field and establish good practices are only the first step, and developing formal standards remains needed to bring further clarity to NLP research and industry, at a time where this community faces various crises regarding ethics or reproducibility. We thus encourage NLP researchers to contribute to existing and upcoming standardization projects, so that they can express their needs and concerns, while sharing their expertise.
Not all dependencies are equal when training a dependency parser: some are straightforward enough to be learned with only a sample of data, others embed more complexity. This work introduces a series of metrics to quantify those differences, and thereby to expose the shortcomings of various parsing algorithms and strategies. Apart from a more thorough comparison of parsing systems, these new tools also prove useful for characterizing the information conveyed by cross-lingual parsers, in a quantitative but still interpretable way.
Because the most common transition systems are projective, training a transition-based dependency parser often implies to either ignore or rewrite the non-projective training examples, which has an adverse impact on accuracy. In this work, we propose a simple modification of dynamic oracles, which enables the use of non-projective data when training projective parsers. Evaluation on 73 treebanks shows that our method achieves significant gains (+2 to +7 UAS for the most non-projective languages) and consistently outperforms traditional projectivization and pseudo-projectivization approaches.
This paper formalizes a sound extension of dynamic oracles to global training, in the frame of transition-based dependency parsers. By dispensing with the pre-computation of references, this extension widens the training strategies that can be entertained for such parsers; we show this by revisiting two standard training procedures, early-update and max-violation, to correct some of their search space sampling biases. Experimentally, on the SPMRL treebanks, this improvement increases the similarity between the train and test distributions and yields performance improvements up to 0.7 UAS, without any computation overhead.
This paper describes LIMSI’s submission to the CoNLL 2017 UD Shared Task, which is focused on small treebanks, and how to improve low-resourced parsing only by ad hoc combination of multiple views and resources. We present our approach for low-resourced parsing, together with a detailed analysis of the results for each test treebank. We also report extensive analysis experiments on model selection for the PUD treebanks, and on annotation consistency among UD treebanks.
Cet article présente une méthode simple de transfert cross-lingue de dépendances. Nous montrons tout d’abord qu’il est possible d’apprendre un analyseur en dépendances par transition à partir de données partiellement annotées. Nous proposons ensuite de construire de grands ensembles de données partiellement annotés pour plusieurs langues cibles en projetant les dépendances via les liens d’alignement les plus sûrs. En apprenant des analyseurs pour les langues cibles à partir de ces données partielles, nous montrons que cette méthode simple obtient des performances qui rivalisent avec celles de méthodes état-de-l’art récentes, tout en ayant un coût algorithmique moindre.
Dans cet article, nous proposons trois améliorations simples pour l’apprentissage global d’analyseurs en dépendances par transition de type A RC E AGER : un oracle non déterministe, la reprise sur le même exemple après une mise à jour et l’entraînement en configurations sous-optimales. Leur combinaison apporte un gain moyen de 0,2 UAS sur le corpus SPMRL. Nous introduisons également un cadre général permettant la comparaison systématique de ces stratégies et de la plupart des variantes connues. Nous montrons que la littérature n’a étudié que quelques stratégies parmi les nombreuses variations possibles, négligeant ainsi plusieurs pistes d’améliorations potentielles.
This paper studies cross-lingual transfer for dependency parsing, focusing on very low-resource settings where delexicalized transfer is the only fully automatic option. We show how to boost parsing performance by rewriting the source sentences so as to better match the linguistic regularities of the target language. We contrast a data-driven approach with an approach relying on linguistically motivated rules automatically extracted from the World Atlas of Language Structures. Our findings are backed up by experiments involving 40 languages. They show that both approaches greatly outperform the baseline, the knowledge-driven method yielding the best accuracies, with average improvements of +2.9 UAS, and up to +90 UAS (absolute) on some frequent PoS configurations.
Because of the small size of Romanian corpora, the performance of a PoS tagger or a dependency parser trained with the standard supervised methods fall far short from the performance achieved in most languages. That is why, we apply state-of-the-art methods for cross-lingual transfer on Romanian tagging and parsing, from English and several Romance languages. We compare the performance with monolingual systems trained with sets of different sizes and establish that training on a few sentences in target language yields better results than transferring from large datasets in other languages.