Efficiently Extract Rrecurring Tree Fragments from Large Treebanks

Federico Sangati, Willem Zuidema, Rens Bod


Abstract
In this paper we describe FragmentSeeker, a tool which is capable to identify all those tree constructions which are recurring multiple times in a large Phrase Structure treebank. The tool is based on an efficient kernel-based dynamic algorithm, which compares every pair of trees of a given treebank and computes the list of fragments which they both share. We describe two different notions of fragments we will use, i.e. standard and partial fragments, and provide the implementation details on how to extract them from a syntactically annotated corpus. We have tested our system on the Penn Wall Street Journal treebank for which we present quantitative and qualitative analysis on the obtained recurring structures, as well as provide empirical time performance. Finally we propose possible ways our tool could contribute to different research fields related to corpus analysis and processing, such as parsing, corpus statistics, annotation guidance, and automatic detection of argument structure.
Anthology ID:
L10-1420
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/613_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Federico Sangati, Willem Zuidema, and Rens Bod. 2010. Efficiently Extract Rrecurring Tree Fragments from Large Treebanks. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Efficiently Extract Rrecurring Tree Fragments from Large Treebanks (Sangati et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/613_Paper.pdf