Allan Hanbury

2021

pdf bib abs
DreamDrug - A crowdsourced NER dataset for detecting drugs in darknet markets
Johannes Bogensperger | Sven Schlarb | Allan Hanbury | Gábor Recski
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

We present DreamDrug, a crowdsourced dataset for detecting mentions of drugs in noisy user-generated item listings from darknet markets. Our dataset contains nearly 15,000 manually annotated drug entities in over 3,500 item listings scraped from the darknet market platform “DreamMarket” in 2017. We also train and evaluate baseline models for detecting these entities, using contextual language models fine-tuned in a few-shot setting and on the full dataset, and examine the effect of pretraining on in-domain unannotated corpora.

2020

pdf bib abs
Effective Crowd-Annotation of Participants, Interventions, and Outcomes in the Text of Clinical Trial Reports
Markus Zlabinger | Marta Sabou | Sebastian Hofstätter | Allan Hanbury
Findings of the Association for Computational Linguistics: EMNLP 2020

The search for Participants, Interventions, and Outcomes (PIO) in clinical trial reports is a critical task in Evidence Based Medicine. For an automatic PIO extraction, high-quality corpora are needed. Obtaining such a corpus from crowdworkers, however, has been shown to be ineffective since (i) workers usually lack domain-specific expertise to conduct the task with sufficient quality, and (ii) the standard approach of annotating entire abstracts of trial reports as one task-instance (i.e. HIT) leads to an uneven distribution in task effort. In this paper, we switch from entire abstract to sentence annotation, referred to as the SenBase approach. We build upon SenBase in SenSupport, where we compensate the lack of domain-specific expertise of crowdworkers by showing for each task-instance similar sentences that are already annotated by experts. Such tailored task-instance examples are retrieved via unsupervised semantic short-text similarity (SSTS) method – and we evaluate nine methods to find an effective solution for SenSupport. We compute the Cohen’s Kappa agreement between crowd-annotations and gold standard annotations and show that (i) both sentence-based approaches outperform a Baseline approach where entire abstracts are annotated; (ii) supporting annotators with tailored task-instance examples is the best performing approach with Kappa agreements of 0.78/0.75/0.69 for P, I, and O respectively.

2018

pdf bib
Medical Entity Corpus with PICO elements and Sentiment Analysis
Markus Zlabinger | Linda Andersson | Allan Hanbury | Michael Andersson | Vanessa Quasnik | Jon Brassey
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Volatility Prediction using Financial Disclosures Sentiments with Word Embedding-based IR Models
Navid Rekabsaz | Mihai Lupu | Artem Baklanov | Alexander Dür | Linda Andersson | Allan Hanbury
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Volatility prediction—an essential concept in financial markets—has recently been addressed using sentiment analysis methods. We investigate the sentiment of annual disclosures of companies in stock markets to forecast volatility. We specifically explore the use of recent Information Retrieval (IR) term weighting models that are effectively extended by related terms using word embeddings. In parallel to textual information, factual market data have been widely used as the mainstream approach to forecast market risk. We therefore study different fusion methods to combine text and market data resources. Our word embedding-based approach significantly outperforms state-of-the-art methods. In addition, we investigate the characteristics of the reports of the companies in different financial sectors.

2016

pdf bib abs
Standard Test Collection for English-Persian Cross-Lingual Word Sense Disambiguation
Navid Rekabsaz | Serwah Sabetghadam | Mihai Lupu | Linda Andersson | Allan Hanbury
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we address the shortage of evaluation benchmarks on Persian (Farsi) language by creating and making available a new benchmark for English to Persian Cross Lingual Word Sense Disambiguation (CL-WSD). In creating the benchmark, we follow the format of the SemEval 2013 CL-WSD task, such that the introduced tools of the task can also be applied on the benchmark. In fact, the new benchmark extends the SemEval-2013 CL-WSD task to Persian language.