Marco Passarotti


2022

pdf
The Index Thomisticus Treebank as Linked Data in the LiLa Knowledge Base
Francesco Mambrini | Marco Passarotti | Giovanni Moretti | Matteo Pellegrini
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Although the Universal Dependencies initiative today allows for cross-linguistically consistent annotation of morphology and syntax in treebanks for several languages, syntactically annotated corpora are not yet interoperable with many lexical resources that describe properties of the words that occur therein. In order to cope with such limitation, we propose to adopt the principles of the Linguistic Linked Open Data community, to describe and publish dependency treebanks as LLOD. In particular, this paper illustrates the approach pursued in the LiLa Knowledge Base, which enables interoperability between corpora and lexical resources for Latin, to publish as Linguistic Linked Open Data the annotation layers of two versions of a Medieval Latin treebank (the Index Thomisticus Treebank).

pdf bib
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
Rachele Sprugnoli | Marco Passarotti
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

pdf
Overview of the EvaLatin 2022 Evaluation Campaign
Rachele Sprugnoli | Marco Passarotti | Flavio Massimiliano Cecchini | Margherita Fantoli | Giovanni Moretti
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

This paper describes the organization and the results of the second edition of EvaLatin, the campaign for the evaluation of Natural Language Processing tools for Latin. The three shared tasks proposed in EvaLatin 2022, i.,e.,Lemmatization, Part-of-Speech Tagging and Features Identification, are aimed to foster research in the field of language technologies for Classical languages. The shared dataset consists of texts mainly taken from the LASLA corpus. More specifically, the training set includes only prose texts of the Classical period, whereas the test set is organized in three sub-tasks: a Classical sub-task on a prose text of an author not included in the training data, a Cross-genre sub-task on poetic and scientific texts, and a Cross-time sub-task on a text of the 15th century. The results obtained by the participants for each task and sub-task are presented and discussed.

pdf
Linking the LASLA Corpus in the LiLa Knowledge Base of Interoperable Linguistic Resources for Latin
Margherita Fantoli | Marco Passarotti | Francesco Mambrini | Giovanni Moretti | Paolo Ruffolo
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

This paper describes the process of interlinking the 130 Classical Latin texts provided by an annotated corpus developed at the LASLA laboratory with the LiLa Knowledge Base, which makes linguistic resources for Latin interoperable by following the principles of the Linked Data paradigm and making reference to classes and properties of widely adopted ontologies to model the relevant information. After introducing the overall architecture of the LiLa Knowledge Base and the LASLA corpus, the paper details the phases of the process of linking the corpus with the collection of lemmas of LiLa and presents a federated query to exemplify the added value of interoperability of LASLA’s texts with other resources for Latin.

pdf
Computational Morphology with OntoLex-Morph
Christian Chiarcos | Katerina Gkirtzou | Fahad Khan | Penny Labropoulou | Marco Passarotti | Matteo Pellegrini
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

This paper describes the current status of the emerging OntoLex module for linguistic morphology. It serves as an update to the previous version of the vocabulary (Klimek et al. 2019). Whereas this earlier model was exclusively focusing on descriptive morphology and focused on applications in lexicography, we now present a novel part and a novel application of the vocabulary to applications in language technology, i.e., the rule-based generation of lexicons, introducing a dynamic component into OntoLex.

2020

pdf
Representing Etymology in the LiLa Knowledge Base of Linguistic Resources for Latin
Francesco Mambrini | Marco Passarotti
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

In this paper we describe the process of inclusion of etymological information in a knowledge base of interoperable Latin linguistic resources developed in the context of the LiLa: Linking Latin project. Interoperability is obtained by applying the Linked Open Data principles. Particularly, an extensive collection of Latin lemmas is used to link the (distributed) resources. For the etymology, we rely on the Ontolex-lemon ontology and the lemonEty extension to model the information, while the source data are taken from a recent etymological dictionary of Latin. As a result, the collection of lemmas LiLa is built around now includes 1,465 Proto-Italic and 1,393 Proto-Indo-European reconstructed forms that are used to explain the history of 1,400 Latin words. We discuss the motivation, methodology and modeling strategies of the work, as well as its possible applications and potential future developments.

pdf bib
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
Rachele Sprugnoli | Marco Passarotti
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

pdf
Overview of the EvaLatin 2020 Evaluation Campaign
Rachele Sprugnoli | Marco Passarotti | Flavio Massimiliano Cecchini | Matteo Pellegrini
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

This paper describes the first edition of EvaLatin, a campaign totally devoted to the evaluation of NLP tools for Latin. The two shared tasks proposed in EvaLatin 2020, i. e. Lemmatization and Part-of-Speech tagging, are aimed at fostering research in the field of language technologies for Classical languages. The shared dataset consists of texts taken from the Perseus Digital Library, processed with UDPipe models and then manually corrected by Latin experts. The training set includes only prose texts by Classical authors. The test set, alongside with prose texts by the same authors represented in the training set, also includes data relative to poetry and to the Medieval period. This also allows us to propose the Cross-genre and Cross-time subtasks for each task, in order to evaluate the portability of NLP tools for Latin across different genres and time periods. The results obtained by the participants for each task and subtask are presented and discussed.

pdf
A New Latin Treebank for Universal Dependencies: Charters between Ancient Latin and Romance Languages
Flavio Massimiliano Cecchini | Timo Korkiakangas | Marco Passarotti
Proceedings of the Twelfth Language Resources and Evaluation Conference

The present work introduces a new Latin treebank that follows the Universal Dependencies (UD) annotation standard. The treebank is obtained from the automated conversion of the Late Latin Charter Treebank 2 (LLCT2), originally in the Prague Dependency Treebank (PDT) style. As this treebank consists of Early Medieval legal documents, its language variety differs considerably from both the Classical and Medieval learned varieties prevalent in the other currently available UD Latin treebanks. Consequently, besides significant phenomena from the perspective of diachronic linguistics, this treebank also poses several challenging technical issues for the current and future syntactic annotation of Latin in the UD framework. Some of the most relevant cases are discussed in depth, with comparisons between the original PDT and the resulting UD annotations. Additionally, an overview of the UD-style structure of the treebank is given, and some diachronic aspects of the transition from Latin to Romance languages are highlighted.

pdf
Odi et Amo. Creating, Evaluating and Extending Sentiment Lexicons for Latin.
Rachele Sprugnoli | Marco Passarotti | Daniela Corbetta | Andrea Peverelli
Proceedings of the Twelfth Language Resources and Evaluation Conference

Sentiment lexicons are essential for developing automatic sentiment analysis systems, but the resources currently available mostly cover modern languages. Lexicons for ancient languages are few and not evaluated with high-quality gold standards. However, the study of attitudes and emotions in ancient texts is a growing field of research which poses specific issues (e.g., lack of native speakers, limited amount of data, unusual textual genres for the sentiment analysis task, such as philosophical or documentary texts) and can have an impact on the work of scholars coming from several disciplines besides computational linguistics, e.g. historians and philologists. The work presented in this paper aims at providing the research community with a set of sentiment lexicons built by taking advantage of manually-curated resources belonging to the long tradition of Latin corpora and lexicons creation. Our interdisciplinary approach led us to release: i) two automatically generated sentiment lexicons; ii) a gold standard developed by two Latin language and culture experts; iii) a silver standard in which semantic and derivational relations are exploited so to extend the list of lexical items of the gold standard. In addition, the evaluation procedure is described together with a first application of the lexicons to a Latin tragedy.

2019

pdf
Harmonizing Different Lemmatization Strategies for Building a Knowledge Base of Linguistic Resources for Latin
Francesco Mambrini | Marco Passarotti
Proceedings of the 13th Linguistic Annotation Workshop

The interoperability between lemmatized corpora of Latin and other resources that use the lemma as indexing key is hampered by the multiple lemmatization strategies that different projects adopt. In this paper we discuss how we tackle the challenges raised by harmonizing different lemmatization criteria in the context of a project that aims to connect linguistic resources for Latin using the Linked Data paradigm. The paper introduces the architecture supporting an open-ended, lemma-based Knowledge Base, built to make textual and lexical resources for Latin interoperable. Particularly, the paper describes the inclusion into the Knowledge Base of its lexical basis, of a word formation lexicon and of a lemmatized and syntactically annotated corpus.

pdf
Linked Open Treebanks. Interlinking Syntactically Annotated Corpora in the LiLa Knowledge Base of Linguistic Resources for Latin
Francesco Mambrini | Marco Passarotti
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

pdf bib
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology
Magda Ševčíková | Zdeněk Žabokrtský | Eleonora Litta | Marco Passarotti
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

pdf
The Treatment of Word Formation in the LiLa Knowledge Base of Linguistic Resources for Latin
Eleonora Litta | Marco Passarotti | Francesco Mambrini
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

2018

pdf
Challenges in Converting the Index Thomisticus Treebank into Universal Dependencies
Flavio Massimiliano Cecchini | Marco Passarotti | Paola Marongiu | Daniel Zeman
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

This paper describes the changes applied to the original process used to convert the Index Thomisticus Treebank, a corpus including texts in Medieval Latin by Thomas Aquinas, into the annotation style of Universal Dependencies. The changes are made both to harmonise the Universal Dependencies version of the Index Thomisticus Treebank with the two other available Latin treebanks and to fix errors and inconsistencies resulting from the original process. The paper details the treatment of different issues in PoS tagging, lemmatisation and assignment of dependency relations. Finally, it assesses the quality of the new conversion process by providing an evaluation against a gold standard.

2017

pdf
The Lemlat 3.0 Package for Morphological Analysis of Latin
Marco Passarotti | Marco Budassi | Eleonora Litta | Paolo Ruffolo
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

pdf
The Treebanked Conspiracy. Actors and Actions in Bellum Catilinae
Marco Passarotti | Berta González Saavedra
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

2016

pdf
Differentia compositionem facit. A Slower-Paced and Reliable Parser for Latin
Edoardo Maria Ponti | Marco Passarotti
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Index Thomisticus Treebank is the largest available treebank for Latin; it contains Medieval Latin texts by Thomas Aquinas. After experimenting on its data with a number of dependency parsers based on different supervised machine learning techniques, we found that DeSR with a multilayer perceptron algorithm, a right-to-left transition, and a tailor-made feature model is the parser providing the highest accuracy rates. We improved the results further by using a technique that combines the output parses of DeSR with those provided by other parsers, outperforming the previous state of the art in parsing the Index Thomisticus Treebank. The key idea behind such improvement is to ensure a sufficient diversity and accuracy of the outputs to be combined; for this reason, we performed an in-depth evaluation of the results provided by the different parsers that we combined. Finally, we assessed that, although the general architecture of the parser is portable to Classical Latin, yet the model trained on Medieval Latin is inadequate for such purpose.

pdf
Latin Vallex. A Treebank-based Semantic Valency Lexicon for Latin
Marco Passarotti | Berta González Saavedra | Christophe Onambele
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Despite a centuries-long tradition in lexicography, Latin lacks state-of-the-art computational lexical resources. This situation is strictly related to the still quite limited amount of linguistically annotated textual data for Latin, which can help the building of new lexical resources by supporting them with empirical evidence. However, projects for creating new language resources for Latin have been launched over the last decade to fill this gap. In this paper, we present Latin Vallex, a valency lexicon for Latin built in mutual connection with the semantic and pragmatic annotation of two Latin treebanks featuring texts of different eras. On the one hand, such a connection between the empirical evidence provided by the treebanks and the lexicon allows to enhance each frame entry in the lexicon with its frequency in real data. On the other hand, each valency-capable word in the treebanks is linked to a frame entry in the lexicon.

pdf
Nomen Omen. Enhancing the Latin Morphological Analyser Lemlat with an Onomasticon
Marco Budassi | Marco Passarotti
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2014

pdf
From Syntax to Semantics. First Steps Towards Tectogrammatical Annotation of Latin
Marco Passarotti
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf
A Compact Interactive Visualization of Dependency Treebank Query Results
Chris Culy | Marco Passarotti | Ulla König-Cardanobile
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

One of the challenges of corpus querying is making sense of the results of a query, especially when a large number of results and linguistically annotated data are concerned. While the most widespread tools for querying syntactically annotated corpora tend to focus on single occurrences, one aspect that is not fully exploited yet in this area is that language is a complex system whose units are connected to each other at both microscopic (the single occurrences) and macroscopic level (the whole system itself). Assuming that language is a system, we describe a tool (using the DoubleTreeJS visualization) to visualize the results of querying dependency treebanks by forming a node from a single item type, and building a network in which the heads and the dependents of the central node are respectively the left and the right vertices of the tree, which are connected to the central node by dependency relations. One case study is presented, consisting in the exploitation of DoubleTreeJS for supporting one assumption in theoretical linguistics with evidence provided by the data of a dependency treebank of Medieval Latin.

pdf
Thomas Aquinas in the TüNDRA: Integrating the Index Thomisticus Treebank into CLARIN-D
Scott Martens | Marco Passarotti
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the integration of the Index Thomisticus Treebank (IT-TB) into the web-based treebank search and visualization application TueNDRA (Tuebingen aNnotated Data Retrieval & Analysis). TueNDRA was originally designed to provide access via the Internet to constituency treebanks and to tools for searching and visualizing them, as well as tabulating statistics about their contents. TueNDRA has now been extended to also provide full support for dependency treebanks with non-projective dependencies, in order to integrate the IT-TB and future treebanks with similar properties. These treebanks are queried using an adapted form of the TIGERSearch query language, which can search both hierarchical and sequential information in treebanks in a single query. As a web application, making the IT-TB accessible via TueNDRA makes the treebank and the tools to use of it available to a large community without having to distribute software and show users how to install it.

2013

pdf
Non-Projectivity in the Ancient Greek Dependency Treebank
Francesco Mambrini | Marco Passarotti
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

2012

pdf
First Steps towards the Semi-automatic Development of a Wordformation-based Lexicon of Latin
Marco Passarotti | Francesco Mambrini
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Although lexicography of Latin has a long tradition dating back to ancient grammarians, and almost all Latin grammars devote to wordformation at least one part of the section(s) concerning morphology, none of the today available lexical resources and NLP tools of Latin feature a wordformation-based organization of the Latin lexicon. In this paper, we describe the first steps towards the semi-automatic development of a wordformation-based lexicon of Latin, by detailing several problems occurring while building the lexicon and presenting our solutions. Developing a wordformation-based lexicon of Latin is nowadays of outmost importance, as the last years have seen a large growth of annotated corpora of Latin texts of different eras. While these corpora include lemmatization, morphological tagging and syntactic analysis, none of them features segmentation of the word forms and wordformation relations between the lexemes. This restricts the browsing and the exploitation of the annotated data for linguistic research and NLP tasks, such as information retrieval and heuristics in PoS tagging of unknown words.

2010

pdf
Improvements in Parsing the Index Thomisticus Treebank. Revision, Combination and a Feature Model for Medieval Latin
Marco Passarotti | Felice Dell’Orletta
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The creation of language resources for less-resourced languages like the historical ones benefits from the exploitation of language-independent tools and methods developed over the years by many projects for modern languages. Along these lines, a number of treebanks for historical languages started recently to arise, including treebanks for Latin. Among the Latin treebanks, the Index Thomisticus Treebank is a 68,000 token dependency treebank based on the Index Thomisticus by Roberto Busa SJ, which contains the opera omnia of Thomas Aquinas (118 texts) as well as 61 texts by other authors related to Thomas, for a total of approximately 11 million tokens. In this paper, we describe a number of modifications that we applied to the dependency parser DeSR, in order to improve the parsing accuracy rates on the Index Thomisticus Treebank. First, we adapted the parser to the specific processing of Medieval Latin, defining an ad-hoc configuration of its features. Then, in order to improve the accuracy rates provided by DeSR, we applied a revision parsing method and we combined the outputs produced by different algorithms. This allowed us to improve accuracy rates substantially, reaching results that are well beyond the state of the art of parsing for Latin.

2009

pdf
The Development of the “Index Thomisticus” Treebank Valency Lexicon
Barbara McGillivray | Marco Passarotti
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

2008

pdf
The Annotation Guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some specific Syntactic Constructions in Latin
David Bamman | Marco Passarotti | Roberto Busa | Gregory Crane
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The paper describes the treatment of some specific syntactic constructions in two treebanks of Latin according to a common set of annotation guidelines. Both projects work within the theoretical framework of Dependency Grammar, which has been demonstrated to be an especially appropriate framework for the representation of languages with a moderately free word order, where the linear order of constituents is broken up with elements of other constituents. The two projects are the first of their kind for Latin, so no prior established guidelines for syntactic annotation are available to rely on. The general model for the adopted style of representation is that used by the Prague Dependency Treebank, with departures arising from the Latin grammar of Pinkster, specifically in the traditional grammatical categories of the ablative absolute, the accusative + infinitive, and gerunds/gerundives. Sharing common annotation guidelines allows us to compare the datasets of the two treebanks for tasks such as mutually checking annotation consistency, diachronically studying specific syntactic constructions, and training statistical dependency parsers.