Alessandro Panunzi

2018

pdf
One event, many representations. Mapping action concepts through visual features.
Alessandro Panunzi | Lorenzo Gregori | Andrea Amelio Ravelli
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Measuring the Italian-English lexical gap for action verbs and its impact on translation
Lorenzo Gregori | Alessandro Panunzi
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications

This paper describes a method to measure the lexical gap of action verbs in Italian and English by using the IMAGACT ontology of action. The fine-grained categorization of action concepts of the data source allowed to have wide overview of the relation between concepts in the two languages. The calculated lexical gap for both English and Italian is about 30% of the action concepts, much higher than previous results. Beyond this general numbers a deeper analysis has been performed in order to evaluate the impact that lexical gaps can have on translation. In particular a distinction has been made between the cases in which the presence of a lexical gap affects translation correctness and completeness at a semantic level. The results highlight a high percentage of concepts that can be considered hard to translate (about 18% from English to Italian and 20% from Italian to English) and confirms that action verbs are a critical lexical class for translation tasks.

2014

pdf abs
The IMAGACT Visual Ontology. An Extendable Multilingual Infrastructure for the representation of lexical encoding of Action
Massimo Moneglia | Susan Brown | Francesca Frontini | Gloria Gagliardi | Fahad Khan | Monica Monachini | Alessandro Panunzi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Action verbs have many meanings, covering actions in different ontological types. Moreover, each language categorizes action in its own way. One verb can refer to many different actions and one action can be identified by more than one verb. The range of variations within and across languages is largely unknown, causing trouble for natural language processing tasks. IMAGACT is a corpus-based ontology of action concepts, derived from English and Italian spontaneous speech corpora, which makes use of the universal language of images to identify the different action types extended by verbs referring to action in English, Italian, Chinese and Spanish. This paper presents the infrastructure and the various linguistic information the user can derive from it. IMAGACT makes explicit the variation of meaning of action verbs within one language and allows comparisons of verb variations within and across languages. Because the action concepts are represented with videos, extension into new languages beyond those presently implemented in IMAGACT is done using competence-based judgments by mother-tongue informants without intense lexicographic work involving underdetermined semantic description

2012

pdf
Verb interpretation for basic action types: annotation, ontology induction and creation of prototypical scenes
Francesca Frontini | Irene De Felice | Fahad Khan | Irene Russo | Monica Monachini | Gloria Gagliardi | Alessandro Panunzi
Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon

pdf abs
The IMAGACT Cross-linguistic Ontology of Action. A new infrastructure for natural language disambiguation
Massimo Moneglia | Monica Monachini | Omar Calabrese | Alessandro Panunzi | Francesca Frontini | Gloria Gagliardi | Irene Russo
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Action verbs, which are highly frequent in speech, cause disambiguation problems that are relevant to Language Technologies. This is a consequence of the peculiar way each natural language categorizes Action i.e. it is a consequence of semantic factors. Action verbs are frequently general, since they extend productively to actions belonging to different ontological types. Moreover, each language categorizes action in its own way and therefore the cross-linguistic reference to everyday activities is puzzling. This paper briefly sketches the IMAGACT project, which aims at setting up a cross-linguistic Ontology of Action for grounding disambiguation tasks in this crucial area of the lexicon. The project derives information on the actual variation of action verbs in English and Italian from spontaneous speech corpora, where references to action are high in frequency. Crucially it makes use of the universal language of images to identify action types, avoiding the underdeterminacy of semantic definitions. Action concept entries are implemented as prototypic scenes; this will make it easier to extend the Ontology to other languages.

pdf abs
RIDIRE-CPI: an Open Source Crawling and Processing Infrastructure for Supervised Web-Corpora Building
Alessandro Panunzi | Marco Fabbri | Massimo Moneglia | Lorenzo Gregori | Samuele Paladini
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper introduces the RIDIRE-CPI, an open source tool for the building of web corpora with a specific design through a targeted crawling strategy. The tool has been developed within the RIDIRE Project, which aims at creating a 2 billion word balanced web corpus for Italian. RIDIRE-CPI architecture integrates existing open source tools as well as modules developed specifically within the RIDIRE project. It consists of various components: a robust crawler (Heritrix), a user friendly web interface, several conversion and cleaning tools, an anti-duplicate filter, a language guesser, and a PoS tagger. The RIDIRE-CPI user-friendly interface is specifically intended for allowing collaborative work performance by users with low skills in web technology and text processing. Moreover, RIDIRE-CPI integrates a validation interface dedicated to the evaluation of the targeted crawling. Through the content selection, metadata assignment, and validation procedures, the RIDIRE-CPI allows the gathering of linguistic data with a supervised strategy that leads to a higher level of control of the corpus contents. The modular architecture of the infrastructure and its open-source distribution will assure the reusability of the tool for other corpus building initiatives.

pdf abs
The annotation of the C-ORAL-BRASIL oral through the implementation of the Palavras Parser
Eckhard Bick | Heliana Mello | Alessandro Panunzi | Tommaso Raso
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This article describes the morphosyntactic annotation of the C-ORAL-BRASIL speech corpus, using an adapted version of the Palavras parser. In order to achieve compatibility with annotation rules designed for standard written Portuguese, transcribed words were orthographically normalized, and the parsing lexicon augmented with speech-specific material, phonetically spelled abbreviations etc. Using a two-level annotation approach, speech flow markers like overlaps, retractions and non-verbal productions were separated from running, annotatable text. In the absence of punctuation, syntactic segmentation was achieved by exploiting prosodic break markers, enhanced by a rule-based distinctions between pause and break functions. Under optimal conditions, the modified parsing system achieved correctness rates (F-scores) of 98.6% for part of speech, 95% for syntactic function and 99% for lemmatization. Especially at the syntactic level, a clear connection between accessibility of prosodic break markers and annotation performance could be documented.

2008

pdf abs
Integration of a Multilingual Keyword Extractor in a Document Management System
Andrea Agili | Marco Fabbri | Alessandro Panunzi | Manuel Zini
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present a new Document Management System called DrStorage. This DMS is multi-platform, JCR-170 compliant, supports WebDav, versioning, user authentication and authorization and the most widespread file formats (Adobe PDF, Microsoft Office, HTML,...). It is also easy to customize in order to enhance its search capabilities and to support automatic metadata assignment. DrStorage has been integrated with an automatic language guesser and with an automatic keyword extractor: these metadata can be assigned automatically to documents, because the DrStorages server part has benn modified to allow that metadata assignment takes place as documents are put in the repository. Metadata can greatly improve the search capabilites and the results quality of a search engine. DrStorages client has been customized with two search results view: the first, called timeline view, shows temporal trends of queries as an histogram, the second, keyword cloud, shows which words are correlated and how much are correlated with the results of a particular day.

2006

pdf abs
Integrating Methods and LRs for Automatic Keyword Extraction from Open Domain Texts
Alessandro Panunzi | Marco Fabbri | Massimo Moneglia
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper presents a tool for keyword extraction from multilingual resources developed within the AXMEDIS project. In this tool lexical collocations (Sinclair, 1991) internal to documents are used to enhance the performance obtained through standard statistical procedure. A first set of mono-term keywords is extracted through the TF.IDF algorithm (Salton, 1989). The internal analysis of the document generates a second set of multi-term keywords based on the first set, rather than on multi-term frequency comparison with a general resource (Witten et al. 1999). Collocations in which a mono-term keyword occurs as the head are considered as multi-term keywords, and are assumed to increase the identification of the content. The evaluation compares the results of the TF.IDF procedure and the ones obtained with the enhanced procedure in terms of precision. Each set of keywords received a value from the point of view of a possible user, regarding: (a) overall efficiency of the whole set of keywords for the identification of the content; (b) adequacy of each extracted keyword. Results show that multi-term keywords increase the content identification with a 100% relative factor and that the adequacy is enhanced in 33% of cases.