Francesco Cutugno


2014

pdf
VOLIP: a corpus of spoken Italian and a virtuous example of reuse of linguistic resources
Iolanda Alfano | Francesco Cutugno | Aurelio De Rosa | Claudio Iacobini | Renata Savy | Miriam Voghera
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The corpus VoLIP (The Voice of LIP) is an Italian speech resource which associates the audio signals to the orthographic transcriptions of the LIP Corpus. The LIP Corpus was designed to represent diaphasic, diatopic and diamesic variation. The Corpus was collected in the early ‘90s to compile a frequency lexicon of spoken Italian and its size was tailored to produce a reliable frequency lexicon for the first 3,000 lemmas. Therefore, it consists of about 500,000 word tokens for 60 hours of recording. The speech materials belong to five different text registers and they were collected in four different cities. Thanks to a modern technological approach VoLIP web service allows users to search the LIP corpus using IMDI metadata, lexical or morpho-syntactic entry keys, receiving as result the audio portions aligned to the corresponding required entry. The VoLIP corpus is freely available at the URL http://www.parlaritaliano.it.

2012

pdf
W-PhAMT: A web tool for phonetic multilevel timeline visualization
Francesco Cutugno | Vincenza Anna Leano | Antonio Origlia
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents a web platform with an its own graphic environment to visualize and filter multilevel phonetic annotations. The tool accepts as input Annotation Graph XML and Praat TextGrids files and converts these files into a specific XML format. XML output is used to browse data by means of a web tool using a visualization metaphor, namely a timeline. A timeline is a graphical representation of a period of time, on which relevant events are marked. Events are usually distributed over many layers in a geometrical metaphor represented by segments and points spatially distributed with reference to a temporal axis. The tool shows all the annotations included in the uploaded dataset, allowing the listening of the entire file or of its parts. Filtering is allowed on annotation labels by means of string pattern matching. The web service includes cloud services to share data with other users. The tool is available at http://w-phamt.fisica.unina.it

2010

pdf
New Features in Spoken Language Search Hawk (SpLaSH): Query Language and Query Sequence
Sara Romano | Francesco Cutugno
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this work we present further development of the SpLaSH (Spoken Language Search Hawk) project. SpLaSH implements a data model for annotated speech corpora integrated with textual markup (i.e. POS tagging, syntax, pragmatics) including a toolkit used to perform complex queries across speech and text labels. The integration of time aligned annotations (TMA), represented making use of Annotation Graphs, with text aligned ones (TXA), stored in generic XML files, are provided by a data structure, the Connector Frame, acting as table-look-up linking temporal data to words in the text. SpLaSH imposes a very limited number of constraints to the data model design, allowing the integration of annotations developed separately within the same dataset and without any relative dependency. It also provides a GUI allowing three types of queries: simple query on TXA or TMA structures, sequence query on TMA structure and cross query on both TXA and TMA integrated structures. In this work new SpLaSH features will be presented: SpLaSH Query Language (SpLaSHQL) and Query Sequence.

2006

pdf
Multilevel corpus analysis: generating and querying an AGset of spoken Italian (SpIt-MDb).
Renata Savy | Francesco Cutugno | Claudia Crocco
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we present an application of AGTK to a corpus of spoken Italian annotated at many different linguistic levels. The work consists of two parts: a) the presentation of AG-SpIt, a toolkit devoted to corpus data management that we developed according to AGTK proposals; b) the presentation of corpus’ structure together with some examples and results of cross-level linguistic analyses obtained querying the database (SpIt-MDb). As this work is still an ongoing investigation, results must be considered preliminary, as a ‘demo’ illustrating the potentiality of the tool and the advantages it introduces to validate linguistic theories and annotation systems. Currently, SpIt-MDb is a linguistic resource under development; it represents one of the first attempts to create an Italian corpus labelled at various linguistic levels (from acoustic/sub-phonetic, to textual/pragmatic ones) which can be queried in the interrelations among levels.

pdf
An observatory on Spoken Italian linguistic resources and descriptive standards.
Miriam Voghera | Francesco Cutugno
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present the national project “Parlare italiano: osservatorio degli usi linguistici”, funded by the Italian Ministry of Education, Scientific Research and University (PRIN 2004). Ten research groups participate to the project from various Italian universities. The project has four fundamental objectives: 1) to plan a national website that collects the most recent theoretical and applied results on spoken language; 2) to create an observatory of the linguistic usages of the Italian spoken language; 3) to delineate and implement standard and formalized methods and procedures for the study of spoken language; 4) to develop a training program for young researchers. The website will be accessible starting from November 2006.