We present DreamDrug, a crowdsourced dataset for detecting mentions of drugs in noisy user-generated item listings from darknet markets. Our dataset contains nearly 15,000 manually annotated drug entities in over 3,500 item listings scraped from the darknet market platform “DreamMarket” in 2017. We also train and evaluate baseline models for detecting these entities, using contextual language models fine-tuned in a few-shot setting and on the full dataset, and examine the effect of pretraining on in-domain unannotated corpora.
The search for Participants, Interventions, and Outcomes (PIO) in clinical trial reports is a critical task in Evidence Based Medicine. For an automatic PIO extraction, high-quality corpora are needed. Obtaining such a corpus from crowdworkers, however, has been shown to be ineffective since (i) workers usually lack domain-specific expertise to conduct the task with sufficient quality, and (ii) the standard approach of annotating entire abstracts of trial reports as one task-instance (i.e. HIT) leads to an uneven distribution in task effort. In this paper, we switch from entire abstract to sentence annotation, referred to as the SenBase approach. We build upon SenBase in SenSupport, where we compensate the lack of domain-specific expertise of crowdworkers by showing for each task-instance similar sentences that are already annotated by experts. Such tailored task-instance examples are retrieved via unsupervised semantic short-text similarity (SSTS) method – and we evaluate nine methods to find an effective solution for SenSupport. We compute the Cohen’s Kappa agreement between crowd-annotations and gold standard annotations and show that (i) both sentence-based approaches outperform a Baseline approach where entire abstracts are annotated; (ii) supporting annotators with tailored task-instance examples is the best performing approach with Kappa agreements of 0.78/0.75/0.69 for P, I, and O respectively.
Volatility prediction—an essential concept in financial markets—has recently been addressed using sentiment analysis methods. We investigate the sentiment of annual disclosures of companies in stock markets to forecast volatility. We specifically explore the use of recent Information Retrieval (IR) term weighting models that are effectively extended by related terms using word embeddings. In parallel to textual information, factual market data have been widely used as the mainstream approach to forecast market risk. We therefore study different fusion methods to combine text and market data resources. Our word embedding-based approach significantly outperforms state-of-the-art methods. In addition, we investigate the characteristics of the reports of the companies in different financial sectors.
In this paper, we address the shortage of evaluation benchmarks on Persian (Farsi) language by creating and making available a new benchmark for English to Persian Cross Lingual Word Sense Disambiguation (CL-WSD). In creating the benchmark, we follow the format of the SemEval 2013 CL-WSD task, such that the introduced tools of the task can also be applied on the benchmark. In fact, the new benchmark extends the SemEval-2013 CL-WSD task to Persian language.