Fabio Ciravegna

Also published as: F. Ciravegna


RP-DNN: A Tweet Level Propagation Context Based Deep Neural Networks for Early Rumor Detection in Social Media
Jie Gao | Sooji Han | Xingyi Song | Fabio Ciravegna
Proceedings of the Twelfth Language Resources and Evaluation Conference

Early rumor detection (ERD) on social media platform is very challenging when limited, incomplete and noisy information is available. Most of the existing methods have largely worked on event-level detection that requires the collection of posts relevant to a specific event and relied only on user-generated content. They are not appropriate to detect rumor sources in the very early stages, before an event unfolds and becomes widespread. In this paper, we address the task of ERD at the message level. We present a novel hybrid neural network architecture, which combines a task-specific character-based bidirectional language model and stacked Long Short-Term Memory (LSTM) networks to represent textual contents and social-temporal contexts of input source tweets, for modelling propagation patterns of rumors in the early stages of their development. We apply multi-layered attention models to jointly learn attentive context embeddings over multiple context inputs. Our experiments employ a stringent leave-one-out cross-validation (LOO-CV) evaluation setup on seven publicly available real-life rumor event data sets. Our models achieve state-of-the-art(SoA) performance for detecting unseen rumors on large augmented data which covers more than 12 events and 2,967 rumors. An ablation study is conducted to understand the relative contribution of each component of our proposed model.


JATE 2.0: Java Automatic Term Extraction with Apache Solr
Ziqi Zhang | Jie Gao | Fabio Ciravegna
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Automatic Term Extraction (ATE) or Recognition (ATR) is a fundamental processing step preceding many complex knowledge engineering tasks. However, few methods have been implemented as public tools and in particular, available as open-source freeware. Further, little effort is made to develop an adaptable and scalable framework that enables customization, development, and comparison of algorithms under a uniform environment. This paper introduces JATE 2.0, a complete remake of the free Java Automatic Term Extraction Toolkit (Zhang et al., 2008) delivering new features including: (1) highly modular, adaptable and scalable ATE thanks to integration with Apache Solr, the open source free-text indexing and search platform; (2) an extended collection of state-of-the-art algorithms. We carry out experiments on two well-known benchmarking datasets and compare the algorithms along the dimensions of effectiveness (precision) and efficiency (speed and memory consumption). To the best of our knowledge, this is by far the only free ATE library offering a flexible architecture and the most comprehensive collection of algorithms.


Real-Time Detection, Tracking, and Monitoring of Automatically Discovered Events in Social Media
Miles Osborne | Sean Moran | Richard McCreadie | Alexander Von Lunen | Martin Sykora | Elizabeth Cano | Neil Ireson | Craig Macdonald | Iadh Ounis | Yulan He | Tom Jackson | Fabio Ciravegna | Ann O’Brien
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations


Mining Equivalent Relations from Linked Data
Ziqi Zhang | Anna Lisa Gentile | Isabelle Augenstein | Eva Blomqvist | Fabio Ciravegna
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)


Automatically Extracting Procedural Knowledge from Instructional Texts using Natural Language Processing
Ziqi Zhang | Philip Webster | Victoria Uren | Andrea Varga | Fabio Ciravegna
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Procedural knowledge is the knowledge required to perform certain tasks, and forms an important part of expertise. A major source of procedural knowledge is natural language instructions. While these readable instructions have been useful learning resources for human, they are not interpretable by machines. Automatically acquiring procedural knowledge in machine interpretable formats from instructions has become an increasingly popular research topic due to their potential applications in process automation. However, it has been insufficiently addressed. This paper presents an approach and an implemented system to assist users to automatically acquire procedural knowledge in structured forms from instructions. We introduce a generic semantic representation of procedures for analysing instructions, using which natural language techniques are applied to automatically extract structured procedures from instructions. The method is evaluated in three domains to justify the generality of the proposed semantic representation as well as the effectiveness of the implemented automatic system.

Unsupervised document zone identification using probabilistic graphical models
Andrea Varga | Daniel Preoţiuc-Pietro | Fabio Ciravegna
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Document zone identification aims to automatically classify sequences of text-spans (e.g. sentences) within a document into predefined zone categories. Current approaches to document zone identification mostly rely on supervised machine learning methods, which require a large amount of annotated data, which is often difficult and expensive to obtain. In order to overcome this bottleneck, we propose graphical models based on the popular Latent Dirichlet Allocation (LDA) model. The first model, which we call zoneLDA aims to cluster the sentences into zone classes using only unlabelled data. We also study an extension of zoneLDA called zoneLDAb, which makes distinction between common words and non-common words within the different zone types. We present results on two different domains: the scientific domain and the technical domain. For the latter one we propose a new document zone classification schema, which has been annotated over a collection of 689 documents, achieving a Kappa score of 85%. Overall our experiments show promising results for both of the domains, outperforming the baseline model. Furthermore, on the technical domain the performance of the models are comparable to the supervised approach using the same feature sets. We thus believe that graphical models are a promising avenue of research for automatic document zoning.


Harnessing different knowledge sources to measure semantic relatedness under a uniform model
Ziqi Zhang | Anna Lisa Gentile | Fabio Ciravegna
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing


Improving Domain-specific Entity Recognition with Automatic Term Recognition and Feature Extraction
Ziqi Zhang | José Iria | Fabio Ciravegna
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Domain specific entity recognition often relies on domain-specific knowledge to improve system performance. However, such knowledge often suffers from limited domain portability and is expensive to build and maintain. Therefore, obtaining it in a generic and unsupervised manner would be a desirable feature for domain-specific entity recognition systems. In this paper, we introduce an approach that exploits domain-specificity of words as a form of domain-knowledge for entity-recognition tasks. Compared to prior work in the field, our approach is generic and completely unsupervised. We empirically show an improvement in entity extraction accuracy when features derived by our unsupervised method are used, with respect to baseline methods that do not employ domain knowledge. We also compared the results against those of existing systems that use manually crafted domain knowledge, and found them to be competitive.


Using Similarity Metrics For Terminology Recognition
Jonathan Butters | Fabio Ciravegna
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present an approach to terminology recognition whereby a sublanguage term (e.g. an aircraft engine component term extracted from a maintenance log) is matched to its corresponding term from a pre-defined list (such as a taxonomy representing the official break-down of the engine). Terminology recognition is addressed as a classification task whereby the extracted term is associated to one or more potential terms in the official description list via the application of string similarity metrics. The solution described in the paper uses dynamically computed similarity cut-off thresholds calculated on the basis of modeling a noise curve. Dissimilar string matches form a Gaussian distributed noise curve that can be identified and extracted leaving only mostly similar string matches. Dynamically calculated thresholds are preferable over fixed similarity thresholds as fixed thresholds are inherently imprecise, that is, there is no similarity boundary beyond which any two strings always describe the same concept.

A Comparative Evaluation of Term Recognition Algorithms
Ziqi Zhang | Jose Iria | Christopher Brewster | Fabio Ciravegna
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Automatic Term recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach us¬ing a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algo¬rithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recog¬nition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process.

Saxon: an Extensible Multimedia Annotator
Mark Greenwood | José Iria | Fabio Ciravegna
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper introduces Saxon, a rule-based document annotator that is capable of processing and annotating several document formats and media, both within and across documents. Furthermore, Saxon is readily extensible to support other input formats due to both it’s flexible rule formalism and the modular plugin architecture of the Runes framework upon which it is built. In this paper we introduce the Saxon rule formalism through examples aimed at highlighting its power and flexibility.


An Incremental Tri-Partite Approach To Ontology Learning
José Iria | Christopher Brewster | Fabio Ciravegna | Yorick Wilks
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we present a new approach to ontology learning. Its basis lies in a dynamic and iterative view of knowledge acquisition for ontologies. The Abraxas approach is founded on three resources, a set of texts, a set of learning patterns and a set of ontological triples, each of which must remain in equilibrium. As events occur which disturb this equilibrium various actions are triggered to re- establish a balance between the resources. Such events include acquisition of a further text from external resources such as the Web or the addition of ontological triples to the ontology. We develop the concept of a knowledge gap between the coverage of an ontology and the corpus of texts as a measure triggering actions. We present an overview of the algorithm and its functionalities.

A Methodology and Tool for Representing Language Resources for Information Extraction
José Iria | Fabio Ciravegna
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In recent years there has been a growing interest in clarifying the process of Information Extraction (IE) from documents, particularly when coupled with Machine Learning. We believe that a fundamental step forward in clarifying the IE process would be to be able to perform comparative evaluations on the use of different representations. However, this is difficult because most of the time the way information is represented is too tightly coupled with the algorithm at an implementation level, making it impossible to vary representation while keeping the algorithm constant. A further motivation behind our work is to reduce the complexity of designing, developing and testing IE systems. The major contribution of this work is in defining a methodology and providing a software infrastructure for representing language resources independently of the algorithm, mainly for Information Extraction but with application in other fields - we are currently evaluating its use for ontology learning and document classification.

An Experimental Study on Boundary Classification Algorithms for Information Extraction using SVM
Jose Iria | Neil Ireson | Fabio Ciravegna
Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006)


A Critical Survey of the Methodology for IE Evaluation
A. Lavelli | M. E. Califf | F. Ciravegna | D. Freitag | C. Giuliano | N. Kushmerick | L. Romano
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

We survey the evaluation methodology adopted in Information Extraction (IE), as defined in the MUC conferences and in later independent efforts applying machine learning to IE. We point out a number of problematic issues that may hamper the comparison between results obtained by different researchers. Some of them are common to other NLP tasks: e.g., the difficulty of exactly identifying the effects on performance of the data (sample selection and sample size), of the domain theory (features selected), and of algorithm parameter settings. Issues specific to IE evaluation include: how leniently to assess inexact identification of filler boundaries, the possibility of multiple fillers for a slot, and how the counting is performed. We argue that, when specifying an information extraction task, a number of characteristics should be clearly defined. However, in the papers only a few of them are usually explicitly specified. Our aim is to elaborate a clear and detailed experimental methodology and propose it to the IE community. The goal is to reach a widespread agreement on such proposal so that future IE evaluations will adopt the proposed methodology, making comparisons between algorithms fair and reliable. In order to achieve this goal, we will develop and make available to the community a set of tools and resources that incorporate a standardized IE methodology.


Mining Web Sites Using Unsupervised Adaptive Information Extraction
Alexiei Dingli | Fabio Ciravegna | David Guthrie | Yorick Wilks
10th Conference of the European Chapter of the Association for Computational Linguistics


Using HLT for Acquiring, Retrieving and Publishing Knowledge in AKT
Kalina Bontcheva | Christopher Brewster | Fabio Ciravegna | Hamish Cunningham | Louise Guthrie | Robert Gaizauskas | Yorick Wilks
Proceedings of the ACL 2001 Workshop on Human Language Technology and Knowledge Management


Grammar Organization for Cascade-based Parsing in Information Extraction
Fabio Ciravegna | Alberto Lavelli
Proceedings of the Sixth International Workshop on Parsing Technologies


Full Text Parsing using Cascades of Rules: an Information Extraction Perspective
Fabio Ciravegna | Alberto Lavelli
Ninth Conference of the European Chapter of the Association for Computational Linguistics


Participatory Design for Linguistic Engineering: the Case of the GEPPETTO Development Environment
Fabio Ciravegna | Alberto Lavelli | Daniela Petrelli | Fabio Pianesi
Computational Environments for Grammar Development and Linguistic Engineering

Controlling Bottom-Up Chart Parsers through Text Chunking
Fabio Ciravegna | Alberto Lavelli
Proceedings of the Fifth International Workshop on Parsing Technologies

In this paper we propose to use text chunking for controlling a bottom-up parser. As it is well known, during analysis such parsers produce many constituents not contributing to the final solution(s). Most of these constituents are introduced due to t he parser inability of checking the input context around them. Preliminary text chunking allows to focus directly on the constituents that seem more likely and to prune the search space in the case some satisfactory solutions are found. Preliminary experiments show that a CYK-like parser controlled through chunking is definitely more efficient than a traditional parser without significantly losing in correctness. Moreover the quality of possible partial results produced by the controlled parser is high. The strategy is particularly suited for tasks like Information Extraction from text (IE) where sentences are often long and complex and it is very difficult to have a complete coverage. Hence, there is a strong necessity of focusing on the most likely solutions; furthermore, in IE the quality of partial results is important .


On Parsing Control for Efficient Text Analysis
Fabio Ciravegna | Alberto Lavelli
Proceedings of the Fourth International Workshop on Parsing Technologies


Knowledge Extraction From Texts by Sintesi
Fabio Ciravegna | Paolo Campia | Alberto Colognese
COLING 1992 Volume 4: The 14th International Conference on Computational Linguistics