Ben Hachey

2020

pdf abs
Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media
Xiang Dai | Sarvnaz Karimi | Ben Hachey | Cecile Paris
Findings of the Association for Computational Linguistics: EMNLP 2020

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

pdf abs
An Effective Transition-based Model for Discontinuous NER
Xiang Dai | Sarvnaz Karimi | Ben Hachey | Cecile Paris
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Unlike widely used Named Entity Recognition (NER) data sets in generic domains, biomedical NER data sets often contain mentions consisting of discontinuous spans. Conventional sequence tagging techniques encode Markov assumptions that are efficient but preclude recovery of these mentions. We propose a simple, effective transition-based model with generic neural encoding for discontinuous NER. Through extensive experiments on three biomedical data sets, we show that our model can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.

2019

Named entity recognition (NER) is widely used in natural language processing applications and downstream tasks. However, most NER tools target flat annotation from popular datasets, eschewing the semantic information available in nested entity mentions. We describe NNE—a fine-grained, nested named entity dataset over the full Wall Street Journal portion of the Penn Treebank (PTB). Our annotation comprises 279,795 mentions of 114 entity types with up to 6 layers of nesting. We hope the public release of this large dataset for English newswire will encourage development of new techniques for nested NER.

pdf abs
Using Similarity Measures to Select Pretraining Data for NER
Xiang Dai | Sarvnaz Karimi | Ben Hachey | Cecile Paris
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Word vectors and Language Models (LMs) pretrained on a large amount of unlabelled data can dramatically improve various Natural Language Processing (NLP) tasks. However, the measure and impact of similarity between pretraining data and target task data are left to intuition. We propose three cost-effective measures to quantify different aspects of similarity between source pretraining and target task data. We demonstrate that these measures are good predictors of the usefulness of pretrained models for Named Entity Recognition (NER) over 30 data pairs. Results also suggest that pretrained LMs are more effective and more predictable than pretrained word vectors, but pretrained word vectors are better when pretraining data is dissimilar.

2018

pdf abs
Can adult mental health be predicted by childhood future-self narratives? Insights from the CLPsych 2018 Shared Task
Kylie Radford | Louise Lavrencic | Ruth Peters | Kim Kiely | Ben Hachey | Scott Nowson | Will Radford
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

The CLPsych 2018 Shared Task B explores how childhood essays can predict psychological distress throughout the author’s life. Our main aim was to build tools to help our psychologists understand the data, propose features and interpret predictions. We submitted two linear regression models: ModelA uses simple demographic and word-count features, while ModelB uses linguistic, entity, typographic, expert-gazetteer, and readability features. Our models perform best at younger prediction ages, with our best unofficial score at 23 of 0.426 disattenuated Pearson correlation. This task is challenging and although predictive performance is limited, we propose that tight integration of expertise across computational linguistics and clinical psychology is a productive direction.

2017

pdf abs
Learning to generate one-sentence biographies from Wikidata
Andrew Chisholm | Will Radford | Ben Hachey
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We investigate the generation of one-sentence Wikipedia biographies from facts derived from Wikidata slot-value pairs. We train a recurrent neural network sequence-to-sequence model with attention to select facts and generate textual summaries. Our model incorporates a novel secondary objective that helps ensure it generates sentences that contain the input facts. The model achieves a BLEU score of 41, improving significantly upon the vanilla sequence-to-sequence model and scoring roughly twice that of a simple template baseline. Human preference evaluation suggests the model is nearly as good as the Wikipedia reference. Manual analysis explores content selection, suggesting the model can trade the ability to infer knowledge against the risk of hallucinating incorrect information.

pdf abs
English Event Detection With Translated Language Features
Sam Wei | Igor Korostil | Joel Nothman | Ben Hachey
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We propose novel radical features from automatic translation for event extraction. Event detection is a complex language processing task for which it is expensive to collect training data, making generalisation challenging. We derive meaningful subword features from automatic translations into target language. Results suggest this method is particularly useful when using languages with writing systems that facilitate easy decomposition into subword features, e.g., logograms and Cangjie. The best result combines logogram features from Chinese and Japanese with syllable features from Korean, providing an additional 3.0 points f-score when added to state-of-the-art generalisation features on the TAC KBP 2015 Event Nugget task.

2016

pdf
CLPsych 2016 Shared Task: Triaging content in online peer-support forums
David N. Milne | Glen Pink | Ben Hachey | Rafael A. Calvo
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf
Classification of mental health forum posts
Glen Pink | Will Radford | Ben Hachey
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf bib
Discovering Entity Knowledge Bases on the Web
Andrew Chisholm | Will Radford | Ben Hachey
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf
:telephone::person::sailboat::whale::okhand: ; or “Call me Ishmael” – How do you translate emoji?
Will Radford | Ben Hachey | Bo Han | Andy Chisholm
Proceedings of the Australasian Language Technology Association Workshop 2016

pdf
Presenting a New Dataset for the Timeline Generation Problem
Xavier Holt | Will Radford | Ben Hachey
Proceedings of the Australasian Language Technology Association Workshop 2016

pdf
Overview of the 2016 ALTA Shared Task: Cross-KB Coreference
Andrew Chisholm | Ben Hachey | Diego Mollá
Proceedings of the Australasian Language Technology Association Workshop 2016

2015

pdf bib
Proceedings of the Australasian Language Technology Association Workshop 2015
Ben Hachey | Kellie Webster
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf
A comparison and analysis of models for event trigger detection
Sam Shang Chun Wei | Ben Hachey
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf abs
Entity Disambiguation with Web Links
Andrew Chisholm | Ben Hachey
Transactions of the Association for Computational Linguistics, Volume 3

Entity disambiguation with Wikipedia relies on structured information from redirect pages, article text, inter-article links, and categories. We explore whether web links can replace a curated encyclopaedia, obtaining entity prior, name, context, and coherence models from a corpus of web pages with links to Wikipedia. Experiments compare web link models to Wikipedia models on well-known conll and tac data sets. Results show that using 34 million web links approaches Wikipedia performance. Combining web link and Wikipedia models produces the best-known disambiguation accuracy of 88.7 on standard newswire test data.

Co-authors

Venues

Ben Hachey

2020

2019

2018

2017

2016

2015

2014

2012

2010

2009

2006

2005

2004

2003

Co-authors

Venues