Jonathon Read


2016

In this paper we present the Corpus of REcommendation STrength (CREST), a collection of HTML-formatted clinical guidelines annotated with the location of recommendations. Recommendations are labelled with an author-provided indicator of their strength of importance. As data was drawn from many disparate authors, we define a unified scheme of importance labels, and provide a mapping for each guideline. We demonstrate the utility of the corpus and its annotations in some initial measurements investigating the type of language constructions associated with strong and weak recommendations, and experiments into promising features for recommendation classification, both with respect to strong and weak labels, and to all labels of the unified scheme. An error analysis indicates that, while there is a strong relationship between lexical choices and strength labels, there can be substantial variance in the choices made by different authors.

2014

2013

2012

We present the WeSearch Data Collection (WDC)―a freely redistributable, partly annotated, comprehensive sample of User-Generated Content. The WDC contains data extracted from a range of genres of varying formality (user forums, product review sites, blogs and Wikipedia) and covers two different domains (NLP and Linux). In this article, we describe the data selection and extraction process, with a focus on the extraction of linguistic content from different sources. We present the format of syntacto-semantic annotations found in this resource and present initial parsing results for these data, as well as some reflections following a first round of treebanking.

2010

2007

2005