This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
LouisEstève
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
The annotation of large text corpora is essential for many tasks. We present here a large automatically annotated corpus for French. This corpus is separated into two parts: the first from BigScience, and the second from HPLT. The annotated documents from HPLT were selected in order to optimise the lexical diversity of the final corpus SELEXINI. An analysis of the impact of this selection was carried out on syntactic diversity, as well as on the quality of the new words resulting from the HPLT part of SELEXINI. We have shown that despite the introduction of interesting new words, the texts extracted from HPLT are very noisy. Furthermore, increasing lexical diversity did not increase syntactic diversity.
L’annotation de grands corpus de texte est essentielle pour de nombreuses tâches de Traitement Automatique des Langues. Dans cet article, nous présentons SELEXINI, un grand corpus français annoté automatiquement en syntaxe. Ce corpus est composé de deux parties : la partie BigScience, et la partie HPLT. Les documents de la partie HPLT ont été sélectionnés dans le but de maximiser la diversité lexicale du corpus total SELEXINI. Une analyse de l’impact de cette sélection sur la diversité syntaxique a été réalisée, ainsi qu’une étude de la qualité des nouveaux mots issus de la partie HPLT du corpus SELEXINI. Nous avons pu montrer que malgré l’introduction de nouveaux mots considérés comme intéressants (formes de conjugaison rares, néologismes, mots rares,...), les textes issus de HPLT sont extrêmement bruités. De plus, l’augmentation de la diversité lexicale n’a pas permis d’augmenter la diversité syntaxique.
Multiword Expressions (MWEs) make a goodcase study for linguistic diversity due to theiridiosyncratic nature. Defining MWE canonicalforms as types, diversity may be measurednotably through disparity, based on pairwisedistances between types. To this aim, wetrain static MWE-aware word embeddings forverbal MWEs in 14 languages, and we showinteresting properties of these vector spaces.We use these vector spaces to implement theso-called functional diversity measure. Weapply this measure to the results of severalMWE identification systems. We find that,although MWE vector spaces are meaningful ata local scale, the disparity measure aggregatingthem at a global scale strongly correlateswith the number of types, which questions itsusefulness in presence of simpler diversitymetrics such as variety. We make the vectorspaces we generated available.
This paper presents the work produced by students of the University of Orlans Masters in Natural Language Processing program by way of participating in SemEval Task 6, LegalEval, which aims to enhance the capabilities of legal professionals through automated systems. Two out of the three sub-tasks available – Rhetorical Role prediction (RR) and Legal Named Entity Recognition (L-NER) – were tackled, with the express intent of developing lightweight and interpretable systems. For the L-NER sub-task, a CRF model was trained, augmented with post-processing rules for some named entity types. A macro F1 score of 0.74 was obtained on the DEV set, and 0.64 on the evaluation set. As for the RR sub-task, two sentence classification systems were built: one based on the Bag-of-Words technique with L-NER system output integrated, the other using a sentence-transformer approach. Rule-based post-processing then converted the results of the sentence classification systems into RR predictions. The better-performing Bag-of-Words system obtained a macro F1 score of 0.49 on the DEV set and 0.57 on the evaluation set.