Ben Hachey


2020

pdf
Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media
Xiang Dai | Sarvnaz Karimi | Ben Hachey | Cecile Paris
Findings of the Association for Computational Linguistics: EMNLP 2020

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

pdf
An Effective Transition-based Model for Discontinuous NER
Xiang Dai | Sarvnaz Karimi | Ben Hachey | Cecile Paris
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Unlike widely used Named Entity Recognition (NER) data sets in generic domains, biomedical NER data sets often contain mentions consisting of discontinuous spans. Conventional sequence tagging techniques encode Markov assumptions that are efficient but preclude recovery of these mentions. We propose a simple, effective transition-based model with generic neural encoding for discontinuous NER. Through extensive experiments on three biomedical data sets, we show that our model can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.

2019

pdf
NNE: A Dataset for Nested Named Entity Recognition in English Newswire
Nicky Ringland | Xiang Dai | Ben Hachey | Sarvnaz Karimi | Cecile Paris | James R. Curran
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Named entity recognition (NER) is widely used in natural language processing applications and downstream tasks. However, most NER tools target flat annotation from popular datasets, eschewing the semantic information available in nested entity mentions. We describe NNE—a fine-grained, nested named entity dataset over the full Wall Street Journal portion of the Penn Treebank (PTB). Our annotation comprises 279,795 mentions of 114 entity types with up to 6 layers of nesting. We hope the public release of this large dataset for English newswire will encourage development of new techniques for nested NER.

pdf
Using Similarity Measures to Select Pretraining Data for NER
Xiang Dai | Sarvnaz Karimi | Ben Hachey | Cecile Paris
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Word vectors and Language Models (LMs) pretrained on a large amount of unlabelled data can dramatically improve various Natural Language Processing (NLP) tasks. However, the measure and impact of similarity between pretraining data and target task data are left to intuition. We propose three cost-effective measures to quantify different aspects of similarity between source pretraining and target task data. We demonstrate that these measures are good predictors of the usefulness of pretrained models for Named Entity Recognition (NER) over 30 data pairs. Results also suggest that pretrained LMs are more effective and more predictable than pretrained word vectors, but pretrained word vectors are better when pretraining data is dissimilar.

2018

pdf
Can adult mental health be predicted by childhood future-self narratives? Insights from the CLPsych 2018 Shared Task
Kylie Radford | Louise Lavrencic | Ruth Peters | Kim Kiely | Ben Hachey | Scott Nowson | Will Radford
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

The CLPsych 2018 Shared Task B explores how childhood essays can predict psychological distress throughout the author’s life. Our main aim was to build tools to help our psychologists understand the data, propose features and interpret predictions. We submitted two linear regression models: ModelA uses simple demographic and word-count features, while ModelB uses linguistic, entity, typographic, expert-gazetteer, and readability features. Our models perform best at younger prediction ages, with our best unofficial score at 23 of 0.426 disattenuated Pearson correlation. This task is challenging and although predictive performance is limited, we propose that tight integration of expertise across computational linguistics and clinical psychology is a productive direction.

2017

pdf
Learning to generate one-sentence biographies from Wikidata
Andrew Chisholm | Will Radford | Ben Hachey
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We investigate the generation of one-sentence Wikipedia biographies from facts derived from Wikidata slot-value pairs. We train a recurrent neural network sequence-to-sequence model with attention to select facts and generate textual summaries. Our model incorporates a novel secondary objective that helps ensure it generates sentences that contain the input facts. The model achieves a BLEU score of 41, improving significantly upon the vanilla sequence-to-sequence model and scoring roughly twice that of a simple template baseline. Human preference evaluation suggests the model is nearly as good as the Wikipedia reference. Manual analysis explores content selection, suggesting the model can trade the ability to infer knowledge against the risk of hallucinating incorrect information.

pdf
English Event Detection With Translated Language Features
Sam Wei | Igor Korostil | Joel Nothman | Ben Hachey
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We propose novel radical features from automatic translation for event extraction. Event detection is a complex language processing task for which it is expensive to collect training data, making generalisation challenging. We derive meaningful subword features from automatic translations into target language. Results suggest this method is particularly useful when using languages with writing systems that facilitate easy decomposition into subword features, e.g., logograms and Cangjie. The best result combines logogram features from Chinese and Japanese with syllable features from Korean, providing an additional 3.0 points f-score when added to state-of-the-art generalisation features on the TAC KBP 2015 Event Nugget task.

2016

pdf
CLPsych 2016 Shared Task: Triaging content in online peer-support forums
David N. Milne | Glen Pink | Ben Hachey | Rafael A. Calvo
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf
Classification of mental health forum posts
Glen Pink | Will Radford | Ben Hachey
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf bib
Discovering Entity Knowledge Bases on the Web
Andrew Chisholm | Will Radford | Ben Hachey
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf
:telephone::person::sailboat::whale::okhand: ; or “Call me Ishmael” – How do you translate emoji?
Will Radford | Ben Hachey | Bo Han | Andy Chisholm
Proceedings of the Australasian Language Technology Association Workshop 2016

pdf
Presenting a New Dataset for the Timeline Generation Problem
Xavier Holt | Will Radford | Ben Hachey
Proceedings of the Australasian Language Technology Association Workshop 2016

pdf
Overview of the 2016 ALTA Shared Task: Cross-KB Coreference
Andrew Chisholm | Ben Hachey | Diego Mollá
Proceedings of the Australasian Language Technology Association Workshop 2016

2015

pdf bib
Proceedings of the Australasian Language Technology Association Workshop 2015
Ben Hachey | Kellie Webster
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf
A comparison and analysis of models for event trigger detection
Sam Shang Chun Wei | Ben Hachey
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf
Entity Disambiguation with Web Links
Andrew Chisholm | Ben Hachey
Transactions of the Association for Computational Linguistics, Volume 3

Entity disambiguation with Wikipedia relies on structured information from redirect pages, article text, inter-article links, and categories. We explore whether web links can replace a curated encyclopaedia, obtaining entity prior, name, context, and coherence models from a corpus of web pages with links to Wikipedia. Experiments compare web link models to Wikipedia models on well-known conll and tac data sets. Results show that using 34 million web links approaches Wikipedia performance. Combining web link and Wikipedia models produces the best-known disambiguation accuracy of 88.7 on standard newswire test data.

2014

pdf
Cheap and easy entity evaluation
Ben Hachey | Joel Nothman | Will Radford
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf
Event Linking: Grounding Event Reference in a News Archive
Joel Nothman | Matthew Honnibal | Ben Hachey | James R. Curran
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2010

pdf bib
Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media
Ben Hachey | Miles Osborne
Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media

pdf
Tracking Information Flow between Primary and Secondary News Sources
Will Radford | Ben Hachey | James Curran | Maria Milosavljevic
Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media

2009

pdf bib
Evaluation of Generic Relation Identification
Ben Hachey
Proceedings of the Australasian Language Technology Association Workshop 2009

pdf
Tracking Information Flow in Financial Text
Will Radford | Ben Hachey | James R. Curran | Maria Milosavljevic
Proceedings of the Australasian Language Technology Association Workshop 2009

pdf
Multi-Document Summarisation Using Generic Relation Extraction
Ben Hachey
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

2006

pdf bib
Dimensionality Reduction Aids Term Co-Occurrence Based Multi-Document Summarization
Ben Hachey | Gabriel Murray | David Reitter
Proceedings of the Workshop on Task-Focused Summarization and Question Answering

pdf
Comparison of Similarity Models for the Relation Discovery Task
Ben Hachey
Proceedings of the Workshop on Linguistic Distances

2005

pdf
Investigating the Effects of Selective Sampling on the Annotation Task
Ben Hachey | Beatrice Alex | Markus Becker
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005)

2004

pdf
A Rhetorical Status Classifier for Legal Text Summarisation
Ben Hachey | Claire Grover
Text Summarization Branches Out

pdf
The HOLJ Corpus. Supporting Summarisation of Legal Texts
Claire Grover | Ben Hachey | Ian Hughson
Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora

2003

pdf
Summarising Legal Texts: Sentential Tense and Argumentative Roles
Claire Grover | Ben Hachey | Chris Korycinski
Proceedings of the HLT-NAACL 03 Text Summarization Workshop

pdf
Demonstration of the CROSSMARC System
Vangelis Karkaletsis | Constantine D. Spyropoulos | Dimitris Souflis | Claire Grover | Ben Hachey | Maria Teresa Pazienza | Michele Vindigni | Emmanuel Cartier | Jose Coch
Companion Volume of the Proceedings of HLT-NAACL 2003 - Demonstrations