Zuhaitz Beloki


2016

pdf
Interoperability of Annotation Schemes: Using the Pepper Framework to Display AWA Documents in the ANNIS Interface
Talvany Carlotto | Zuhaitz Beloki | Xabier Artola | Aitor Soroa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Natural language processing applications are frequently integrated to solve complex linguistic problems, but the lack of interoperability between these tools tends to be one of the main issues found in that process. That is often caused by the different linguistic formats used across the applications, which leads to attempts to both establish standard formats to represent linguistic information and to create conversion tools to facilitate this integration. Pepper is an example of the latter, as a framework that helps the conversion between different linguistic annotation formats. In this paper, we describe the use of Pepper to convert a corpus linguistically annotated by the annotation scheme AWA into the relANNIS format, with the ultimate goal of interacting with AWA documents through the ANNIS interface. The experiment converted 40 megabytes of AWA documents, allowed their use on the ANNIS interface, and involved making architectural decisions during the mapping from AWA into relANNIS using Pepper. The main issues faced during this process were due to technical issues mainly caused by the integration of the different systems and projects, namely AWA, Pepper and ANNIS.

pdf
Two Architectures for Parallel Processing of Huge Amounts of Text
Mathijs Kattenberg | Zuhaitz Beloki | Aitor Soroa | Xabier Artola | Antske Fokkens | Paul Huygen | Kees Verstoep
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents two alternative NLP architectures to analyze massive amounts of documents, using parallel processing. The two architectures focus on different processing scenarios, namely batch-processing and streaming processing. The batch-processing scenario aims at optimizing the overall throughput of the system, i.e., minimizing the overall time spent on processing all documents. The streaming architecture aims to minimize the time to process real-time incoming documents and is therefore especially suitable for live feeds. The paper presents experiments with both architectures, and reports the overall gain when they are used for batch as well as for streaming processing. All the software described in the paper is publicly available under free licenses.

2014

pdf
A stream computing approach towards scalable NLP
Xabier Artola | Zuhaitz Beloki | Aitor Soroa
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Computational power needs have grown dramatically in recent years. This is also the case in many language processing tasks, due to overwhelming quantities of textual information that must be processed in a reasonable time frame. This scenario has led to a paradigm shift in the computing architectures and large-scale data processing strategies used in the NLP field. In this paper we describe a series of experiments carried out in the context of the NewsReader project with the goal of analyzing the scaling capabilities of the language processing pipeline used in it. We explore the use of Storm in a new approach for scalable distributed language processing across multiple machines and evaluate its effectiveness and efficiency when processing documents on a medium and large scale. The experiments have shown that there is a big room for improvement regarding language processing performance when adopting parallel architectures, and that we might expect even better results with the use of large clusters with many processing nodes.