Querying Diverse Treebanks in a Uniform Way

Jan Štěpánek, Petr Pajas


Abstract
This paper presents a system for querying treebanks in a uniform way. The system is able to work with both dependency and constituency based treebanks in any language. We demonstrate its abilities on 11 different treebanks. The query language used by the system provides many features not available in other existing systems while still keeping the performance efficient. The paper also describes the conversion of ten treebanks into a common XML-based format used by the system, touching the question of standards and formats. The paper then shows several examples of linguistically interesting questions that the system is able to answer, for example browsing verbal clauses without subjects or extraposed relative clauses, generating the underlying grammar in a constituency treebank, searching for non-projective edges in a dependency treebank, or word-order typology of a language based on the treebank. The performance of several implementations of the system is also discussed by measuring the time requirements of some of the queries.
Anthology ID:
L10-1260
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/381_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Jan Štěpánek and Petr Pajas. 2010. Querying Diverse Treebanks in a Uniform Way. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Querying Diverse Treebanks in a Uniform Way (Štěpánek & Pajas, LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/381_Paper.pdf