José Iria

Also published as: Jose Iria


Improving Domain-specific Entity Recognition with Automatic Term Recognition and Feature Extraction
Ziqi Zhang | José Iria | Fabio Ciravegna
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Domain specific entity recognition often relies on domain-specific knowledge to improve system performance. However, such knowledge often suffers from limited domain portability and is expensive to build and maintain. Therefore, obtaining it in a generic and unsupervised manner would be a desirable feature for domain-specific entity recognition systems. In this paper, we introduce an approach that exploits domain-specificity of words as a form of domain-knowledge for entity-recognition tasks. Compared to prior work in the field, our approach is generic and completely unsupervised. We empirically show an improvement in entity extraction accuracy when features derived by our unsupervised method are used, with respect to baseline methods that do not employ domain knowledge. We also compared the results against those of existing systems that use manually crafted domain knowledge, and found them to be competitive.

A Random Graph Walk based Approach to Computing Semantic Relatedness Using Knowledge from Wikipedia
Ziqi Zhang | Anna Lisa Gentile | Lei Xia | José Iria | Sam Chapman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Determining semantic relatedness between words or concepts is a fundamental process to many Natural Language Processing applications. Approaches for this task typically make use of knowledge resources such as WordNet and Wikipedia. However, these approaches only make use of limited number of features extracted from these resources, without investigating the usefulness of combining various different features and their importance in the task of semantic relatedness. In this paper, we propose a random walk model based approach to measuring semantic relatedness between words or concepts, which seamlessly integrates various features extracted from Wikipedia to compute semantic relatedness. We empirically study the usefulness of these features in the task, and prove that by combining multiple features that are weighed according to their importance, our system obtains competitive results, and outperforms other systems on some datasets.


Too Many Mammals: Improving the Diversity of Automatically Recognized Terms
Ziqi Zhang | Lei Xia | Mark A. Greenwood | José Iria
Proceedings of the International Conference RANLP-2009

pdf bib
A Novel Approach to Automatic Gazetteer Generation using Wikipedia
Ziqi Zhang | José Iria
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)


A Comparative Evaluation of Term Recognition Algorithms
Ziqi Zhang | Jose Iria | Christopher Brewster | Fabio Ciravegna
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Automatic Term recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach us¬ing a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algo¬rithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recog¬nition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process.

An Approach to Modeling Heterogeneous Resources for Information Extraction
Lei Xia | José Iria
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we describe an approach that aims to model heterogeneous resources for information extraction. Document is modeled in graph representation that enables better understanding of multi-media document and its structure which ultimately could result better cross-media information extraction. We also describe our proposed algorithm that segment document-based on the document modeling approach we described in this paper.

Saxon: an Extensible Multimedia Annotator
Mark Greenwood | José Iria | Fabio Ciravegna
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper introduces Saxon, a rule-based document annotator that is capable of processing and annotating several document formats and media, both within and across documents. Furthermore, Saxon is readily extensible to support other input formats due to both it’s flexible rule formalism and the modular plugin architecture of the Runes framework upon which it is built. In this paper we introduce the Saxon rule formalism through examples aimed at highlighting its power and flexibility.


WIT: Web People Search Disambiguation using Random Walks
José Iria | Lei Xia | Ziqi Zhang
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)


An Incremental Tri-Partite Approach To Ontology Learning
José Iria | Christopher Brewster | Fabio Ciravegna | Yorick Wilks
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we present a new approach to ontology learning. Its basis lies in a dynamic and iterative view of knowledge acquisition for ontologies. The Abraxas approach is founded on three resources, a set of texts, a set of learning patterns and a set of ontological triples, each of which must remain in equilibrium. As events occur which disturb this equilibrium various actions are triggered to re- establish a balance between the resources. Such events include acquisition of a further text from external resources such as the Web or the addition of ontological triples to the ontology. We develop the concept of a knowledge gap between the coverage of an ontology and the corpus of texts as a measure triggering actions. We present an overview of the algorithm and its functionalities.

A Methodology and Tool for Representing Language Resources for Information Extraction
José Iria | Fabio Ciravegna
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In recent years there has been a growing interest in clarifying the process of Information Extraction (IE) from documents, particularly when coupled with Machine Learning. We believe that a fundamental step forward in clarifying the IE process would be to be able to perform comparative evaluations on the use of different representations. However, this is difficult because most of the time the way information is represented is too tightly coupled with the algorithm at an implementation level, making it impossible to vary representation while keeping the algorithm constant. A further motivation behind our work is to reduce the complexity of designing, developing and testing IE systems. The major contribution of this work is in defining a methodology and providing a software infrastructure for representing language resources independently of the algorithm, mainly for Information Extraction but with application in other fields - we are currently evaluating its use for ontology learning and document classification.

An Experimental Study on Boundary Classification Algorithms for Information Extraction using SVM
Jose Iria | Neil Ireson | Fabio Ciravegna
Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006)