The WeSearch Corpus, Treebank, and Treecache – A Comprehensive Sample of User-Generated Content

Jonathon Read, Dan Flickinger, Rebecca Dridan, Stephan Oepen, Lilja Øvrelid


Abstract
We present the WeSearch Data Collection (WDC)―a freely redistributable, partly annotated, comprehensive sample of User-Generated Content. The WDC contains data extracted from a range of genres of varying formality (user forums, product review sites, blogs and Wikipedia) and covers two different domains (NLP and Linux). In this article, we describe the data selection and extraction process, with a focus on the extraction of linguistic content from different sources. We present the format of syntacto-semantic annotations found in this resource and present initial parsing results for these data, as well as some reflections following a first round of treebanking.
Anthology ID:
L12-1454
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1829–1835
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/774_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Jonathon Read, Dan Flickinger, Rebecca Dridan, Stephan Oepen, and Lilja Øvrelid. 2012. The WeSearch Corpus, Treebank, and Treecache – A Comprehensive Sample of User-Generated Content. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1829–1835, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
The WeSearch Corpus, Treebank, and Treecache – A Comprehensive Sample of User-Generated Content (Read et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/774_Paper.pdf