2023
pdf
abs
Annotators-in-the-loop: Testing a Novel Annotation Procedure on Italian Case Law
Emma Zanoli
|
Matilde Barbini
|
Davide Riva
|
Sergio Picascia
|
Emanuela Furiosi
|
Stefano D’Ancona
|
Cristiano Chesi
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)
The availability of annotated legal corpora is crucial for a number of tasks, such as legal search, legal information retrieval, and predictive justice. Annotation is mostly assumed to be a straightforward task: as long as the annotation scheme is well defined and the guidelines are clear, annotators are expected to agree on the labels. This is not always the case, especially in legal annotation, which can be extremely difficult even for expert annotators. We propose a legal annotation procedure that takes into account annotator certainty and improves it through negotiation. We also collect annotator feedback and show that our approach contributes to a positive annotation environment. Our work invites reflection on often neglected ethical concerns regarding legal annotation.
2022
pdf
abs
SemEval-2022 Task 3: PreTENS-Evaluating Neural Networks on Presuppositional Semantic Knowledge
Roberto Zamparelli
|
Shammur Chowdhury
|
Dominique Brunato
|
Cristiano Chesi
|
Felice Dell’Orletta
|
Md. Arid Hasan
|
Giulia Venturi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
We report the results of the SemEval 2022 Task 3, PreTENS, on evaluation the acceptability of simple sentences containing constructions whose two arguments are presupposed to be or not to be in an ordered taxonomic relation. The task featured two sub-tasks articulated as: (i) binary prediction task and (ii) regression task, predicting the acceptability in a continuous scale. The sentences were artificially generated in three languages (English, Italian and French). 21 systems, with 8 system papers were submitted for the task, all based on various types of fine-tuned transformer systems, often with ensemble methods and various data augmentation techniques. The best systems reached an F1-macro score of 94.49 (sub-task1) and a Spearman correlation coefficient of 0.80 (sub-task2), with interesting variations in specific constructions and/or languages.
2014
pdf
abs
Casa de la Lhéngua: a set of language resources and natural language processing tools for Mirandese
José Pedro Ferreira
|
Cristiano Chesi
|
Daan Baldewijns
|
Fernando Miguel Pinto
|
Margarita Correia
|
Daniela Braga
|
Hyongsil Cho
|
Amadeu Ferreira
|
Miguel Dias
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper describes the efforts for the construction of Language Resources and NLP tools for Mirandese, a minority language spoken in North-eastern Portugal, now available on a community-led portal, Casa de la Lhéngua. The resources were developed in the context of a collaborative citizenship project led by Microsoft, in the context of the creation of the first TTS system for Mirandese. Development efforts encompassed the compilation of a corpus with over 1M tokens, the construction of a GTP system, syllable-division, inflection and a Part-of-Speech (POS) tagger modules, leading to the creation of an inflected lexicon of about 200.000 entries with phonetic transcription, detailed POS tagging, syllable division, and stress mark-up. Alongside these tasks, which were made easier through the adaptation and reuse of existing tools for closely related languages, a casting for voice talents among the speaking community was conducted and the first speech database for speech synthesis was recorded for Mirandese. These resources were combined to fulfil the requirements of a well-tested statistical parameter synthesis model, leading to an intelligible voice font. These language resources are available freely at Casa de la Lhéngua, aiming at promoting the development of real-life applications and fostering linguistic research on Mirandese.