2021
pdf
bib
abs
What Makes My Model Perplexed? A Linguistic Investigation on Neural Language Models Perplexity
Alessio Miaschi
|
Dominique Brunato
|
Felice Dell’Orletta
|
Giulia Venturi
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures
This paper presents an investigation aimed at studying how the linguistic structure of a sentence affects the perplexity of two of the most popular Neural Language Models (NLMs), BERT and GPT-2. We first compare the sentence-level likelihood computed with BERT and the GPT-2’s perplexity showing that the two metrics are correlated. In addition, we exploit linguistic features capturing a wide set of morpho-syntactic and syntactic phenomena showing how they contribute to predict the perplexity of the two NLMs.
2020
pdf
bib
abs
Linguistic Profiling of a Neural Language Model
Alessio Miaschi
|
Dominique Brunato
|
Felice Dell’Orletta
|
Giulia Venturi
Proceedings of the 28th International Conference on Computational Linguistics
In this paper we investigate the linguistic knowledge learned by a Neural Language Model (NLM) before and after a fine-tuning process and how this knowledge affects its predictions during several classification problems. We use a wide set of probing tasks, each of which corresponds to a distinct sentence-level feature extracted from different levels of linguistic annotation. We show that BERT is able to encode a wide range of linguistic characteristics, but it tends to lose this information when trained on specific downstream tasks. We also find that BERT’s capacity to encode different kind of linguistic properties has a positive influence on its predictions: the more it stores readable linguistic information of a sentence, the higher will be its capacity of predicting the expected label assigned to that sentence.
pdf
bib
abs
Tracking the Evolution of Written Language Competence in L2 Spanish Learners
Alessio Miaschi
|
Sam Davidson
|
Dominique Brunato
|
Felice Dell’Orletta
|
Kenji Sagae
|
Claudia Helena Sanchez-Gutierrez
|
Giulia Venturi
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
In this paper we present an NLP-based approach for tracking the evolution of written language competence in L2 Spanish learners using a wide range of linguistic features automatically extracted from students’ written productions. Beyond reporting classification results for different scenarios, we explore the connection between the most predictive features and the teaching curriculum, finding that our set of linguistic features often reflect the explicit instructions that students receive during each course.
pdf
bib
abs
“Voices of the Great War”: A Richly Annotated Corpus of Italian Texts on the First World War
Federico Boschetti
|
Irene De Felice
|
Stefano Dei Rossi
|
Felice Dell’Orletta
|
Michele Di Giorgio
|
Martina Miliani
|
Lucia C. Passaro
|
Angelica Puddu
|
Giulia Venturi
|
Nicola Labanca
|
Alessandro Lenci
|
Simonetta Montemagni
Proceedings of the 12th Language Resources and Evaluation Conference
“Voices of the Great War” is the first large corpus of Italian historical texts dating back to the period of First World War. This corpus differs from other existing resources in several respects. First, from the linguistic point of view it gives account of the wide range of varieties in which Italian was articulated in that period, namely from a diastratic (educated vs. uneducated writers), diaphasic (low/informal vs. high/formal registers) and diatopic (regional varieties, dialects) points of view. From the historical perspective, through a collection of texts belonging to different genres it represents different views on the war and the various styles of narrating war events and experiences. The final corpus is balanced along various dimensions, corresponding to the textual genre, the language variety used, the author type and the typology of conveyed contents. The corpus is fully annotated with lemmas, part-of-speech, terminology, and named entities. Significant corpus samples representative of the different “voices” have also been enriched with meta-linguistic and syntactic information. The layer of syntactic annotation forms the first nucleus of an Italian historical treebank complying with the Universal Dependencies standard. The paper illustrates the final resource, the methodology and tools used to build it, and the Web Interface for navigating it.
pdf
bib
abs
Profiling-UD: a Tool for Linguistic Profiling of Texts
Dominique Brunato
|
Andrea Cimino
|
Felice Dell’Orletta
|
Giulia Venturi
|
Simonetta Montemagni
Proceedings of the 12th Language Resources and Evaluation Conference
In this paper, we introduce Profiling–UD, a new text analysis tool inspired to the principles of linguistic profiling that can support language variation research from different perspectives. It allows the extraction of more than 130 features, spanning across different levels of linguistic description. Beyond the large number of features that can be monitored, a main novelty of Profiling–UD is that it has been specifically devised to be multilingual since it is based on the Universal Dependencies framework. In the second part of the paper, we demonstrate the effectiveness of these features in a number of theoretical and applicative studies in which they were successfully used for text and author profiling.
2018
pdf
bib
abs
Is this Sentence Difficult? Do you Agree?
Dominique Brunato
|
Lorenzo De Mattei
|
Felice Dell’Orletta
|
Benedetta Iavarone
|
Giulia Venturi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
In this paper, we present a crowdsourcing-based approach to model the human perception of sentence complexity. We collect a large corpus of sentences rated with judgments of complexity for two typologically-different languages, Italian and English. We test our approach in two experimental scenarios aimed to investigate the contribution of a wide set of lexical, morpho-syntactic and syntactic phenomena in predicting i) the degree of agreement among annotators independently from the assigned judgment and ii) the perception of sentence complexity.
pdf
bib
abs
Assessing the Impact of Incremental Error Detection and Correction. A Case Study on the Italian Universal Dependency Treebank
Chiara Alzetta
|
Felice Dell’Orletta
|
Simonetta Montemagni
|
Maria Simi
|
Giulia Venturi
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)
Detection and correction of errors and inconsistencies in “gold treebanks” are becoming more and more central topics of corpus annotation. The paper illustrates a new incremental method for enhancing treebanks, with particular emphasis on the extension of error patterns across different textual genres and registers. Impact and role of corrections have been assessed in a dependency parsing experiment carried out with four different parsers, whose results are promising. For both evaluation datasets, the performance of parsers increases, in terms of the standard LAS and UAS measures and of a more focused measure taking into account only relations involved in error patterns, and at the level of individual dependencies.
pdf
bib
Universal Dependencies and Quantitative Typological Trends. A Case Study on Word Order
Chiara Alzetta
|
Felice Dell’Orletta
|
Simonetta Montemagni
|
Giulia Venturi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
bib
Dangerous Relations in Dependency Treebanks
Chiara Alzetta
|
Felice Dell’Orletta
|
Simonetta Montemagni
|
Giulia Venturi
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories
2016
pdf
bib
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
Dominique Brunato
|
Felice Dell’Orletta
|
Giulia Venturi
|
Thomas François
|
Philippe Blache
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
pdf
bib
PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification
Dominique Brunato
|
Andrea Cimino
|
Felice Dell’Orletta
|
Giulia Venturi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
pdf
bib
abs
CItA: an L1 Italian Learners Corpus to Study the Development of Writing Competence
Alessia Barbagli
|
Pietro Lucisano
|
Felice Dell’Orletta
|
Simonetta Montemagni
|
Giulia Venturi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper, we present the CItA corpus (Corpus Italiano di Apprendenti L1), a collection of essays written by Italian L1 learners collected during the first and second year of lower secondary school. The corpus was built in the framework of an interdisciplinary study jointly carried out by computational linguistics and experimental pedagogists and aimed at tracking the development of written language competence over the years and students’ background information.
2015
pdf
bib
Design and Annotation of the First Italian Corpus for Text Simplification
Dominique Brunato
|
Felice Dell’Orletta
|
Giulia Venturi
|
Simonetta Montemagni
Proceedings of The 9th Linguistic Annotation Workshop
pdf
bib
NLP–Based Readability Assessment of Health–Related Texts: a Case Study on Italian Informed Consent Forms
Giulia Venturi
|
Tommaso Bellandi
|
Felice Dell’Orletta
|
Simonetta Montemagni
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis
2014
pdf
bib
abs
T2K^2: a System for Automatically Extracting and Organizing Knowledge from Texts
Felice Dell’Orletta
|
Giulia Venturi
|
Andrea Cimino
|
Simonetta Montemagni
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper, we present T2K^2, a suite of tools for automatically extracting domain―specific knowledge from collections of Italian and English texts. T2K^2 (Text―To―Knowledge v2) relies on a battery of tools for Natural Language Processing (NLP), statistical text analysis and machine learning which are dynamically integrated to provide an accurate and incremental representation of the content of vast repositories of unstructured documents. Extracted knowledge ranges from domain―specific entities and named entities to the relations connecting them and can be used for indexing document collections with respect to different information types. T2K^2 also includes linguistic profiling functionalities aimed at supporting the user in constructing the acquisition corpus, e.g. in selecting texts belonging to the same genre or characterized by the same degree of specialization or in monitoring the added value of newly inserted documents. T2K^2 is a web application which can be accessed from any browser through a personal account which has been tested in a wide range of domains.
pdf
bib
Assessing the Readability of Sentences: Which Corpora and Features?
Felice Dell’Orletta
|
Martijn Wieling
|
Giulia Venturi
|
Andrea Cimino
|
Simonetta Montemagni
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications
2013
pdf
bib
Linguistic Profiling of Texts Across Textual Genres and Readability Levels. An Exploratory Study on Italian Fictional Prose
Felice Dell’Orletta
|
Simonetta Montemagni
|
Giulia Venturi
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013
pdf
bib
Linguistic Profiling based on General–purpose Features and Native Language Identification
Andrea Cimino
|
Felice Dell’Orletta
|
Giulia Venturi
|
Simonetta Montemagni
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications
pdf
bib
Unsupervised Linguistically-Driven Reliable Dependency Parses Detection and Self-Training for Adaptation to the Biomedical Domain
Felice Dell’Orletta
|
Giulia Venturi
|
Simonetta Montemagni
Proceedings of the 2013 Workshop on Biomedical Natural Language Processing
2012
pdf
bib
Genre-oriented Readability Assessment: a Case Study
Felice Dell’Orletta
|
Giulia Venturi
|
Simonetta Montemagni
Proceedings of the Workshop on Speech and Language Processing Tools in Education
pdf
bib
abs
Enriching the ISST-TANL Corpus with Semantic Frames
Alessandro Lenci
|
Simonetta Montemagni
|
Giulia Venturi
|
Maria Grazia Cutrullà
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The paper describes the design and the results of a manual annotation methodology devoted to enrich the ISST--TANL Corpus, derived from the Italian Syntactic--Semantic Treebank (ISST), with Semantic Frames information. The main issues encountered in applying the English FrameNet annotation criteria to a corpus of Italian language are discussed together with the choice of anchoring the semantic annotation layer to the underlying dependency syntactic structure. The results of a case study aimed at extending and specialising this methodology for the annotation of a corpus of legislative texts are also discussed.
2011
pdf
bib
ULISSE: an Unsupervised Algorithm for Detecting Reliable Dependency Parses
Felice Dell’Orletta
|
Giulia Venturi
|
Simonetta Montemagni
Proceedings of the Fifteenth Conference on Computational Natural Language Learning
pdf
bib
READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification
Felice Dell’Orletta
|
Simonetta Montemagni
|
Giulia Venturi
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies
2010
pdf
bib
abs
A Contrastive Approach to Multi-word Extraction from Domain-specific Corpora
Francesca Bonin
|
Felice Dell’Orletta
|
Simonetta Montemagni
|
Giulia Venturi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we present a novel approach to multi-word terminology extraction combining a well-known automatic term recognition approach, the C--NC value method, with a contrastive ranking technique, aimed at refining obtained results either by filtering noise due to common words or by discerning between semantically different types of terms within heterogeneous terminologies. Differently from other contrastive methods proposed in the literature that focus on single terms to overcome the multi-word terms' sparsity problem, the proposed contrastive function is able to handle variation in low frequency events by directly operating on pre-selected multi-word terms. This methodology has been tested in two case studies carried out in the History of Art and Legal domains. Evaluation of achieved results showed that the proposed two--stage approach improves significantly multi--word term extraction results. In particular, for what concerns the legal domain it provides an answer to a well-known problem in the semi--automatic construction of legal ontologies, namely that of singling out law terms from terms of the specific domain being regulated.
pdf
bib
Contrastive Filtering of Domain-Specific Multi-Word Terms from Different Types of Corpora
Francesca Bonin
|
Felice Dell’Orletta
|
Giulia Venturi
|
Simonetta Montemagni
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications
2008
pdf
bib
abs
Building a Bio-Event Annotated Corpus for the Acquisition of Semantic Frames from Biomedical Corpora
Paul Thompson
|
Philip Cotter
|
John McNaught
|
Sophia Ananiadou
|
Simonetta Montemagni
|
Andrea Trabucco
|
Giulia Venturi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper reports on the design and construction of a bio-event annotated corpus which was developed with a specific view to the acquisition of semantic frames from biomedical corpora. We describe the adopted annotation scheme and the annotation process, which is supported by a dedicated annotation tool. The annotated corpus contains 677 abstracts of biomedical research articles.