Ulrik Petersen

2006

pdf abs
Querying Both Parallel And Treebank Corpora: Evaluation Of A Corpus Query System
Ulrik Petersen
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The last decade has seen a large increase in the number of available corpus query systems. Some of these are optimized for a particular kind of linguistic annotation (e.g., time-aligned, treebank, word-oriented, etc.). In this paper, we report on our own corpus query system, called Emdros. Emdros is very generic, and can be applied to almost any kind of linguistic annotation using almost any linguistic theory. We describe Emdros and its query language, showing some of the benfits that linguists can derive from using Emdros for their corpora. We then describe the underlying database model of Emdros, and show how two corpora can be imported into the system. One of the two is a parallel corpus of Hungarian and English (the Hunglish corpus), while the other is a treebank of German (the TIGER Corpus). In order to evaluate the performance of Emdros, we then run some performance tests. It is shown that Emdros has extremely good performance on small corpora (less than 1 million words), and that it scales well to corpora of many millions of words.

2004

pdf
Emdros - a text database engine for analyzed or annotated text
Ulrik Petersen
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Co-authors

Venues

lrec1
coling1