Steve Cassidy

2018

2017

pdf
Overview of the 2017 ALTA Shared Task: Correcting OCR Errors
Diego Mollá-Aliod | Steve Cassidy
Proceedings of the Australasian Language Technology Association Workshop 2017

2016

pdf abs
Publishing the Trove Newspaper Corpus
Steve Cassidy
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Trove Newspaper Corpus is derived from the National Library of Australia’s digital archive of newspaper text. The corpus is a snapshot of the NLA collection taken in 2015 to be made available for language research as part of the Alveo Virtual Laboratory and contains 143 million articles dating from 1806 to 2007. This paper describes the work we have done to make this large corpus available as a research collection, facilitating access to individual documents and enabling large scale processing of the newspaper text in a cloud-based environment.

2015

pdf
Finding Names in Trove: Named Entity Recognition for Australian Historical Newspapers
Sunghwan Mac Kim | Steve Cassidy
Proceedings of the Australasian Language Technology Association Workshop 2015

2014

pdf
Alveo, a Human Communication Science Virtual Laboratory
Dominique Estival | Steve Cassidy
Proceedings of the Australasian Language Technology Association Workshop 2014

pdf bib
Integrating UIMA with Alveo, a human communication science virtual laboratory
Dominique Estival | Steve Cassidy | Karin Verspoor | Andrew MacKinlay | Denis Burnham
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT

pdf abs
AusTalk: an audio-visual corpus of Australian English
Dominique Estival | Steve Cassidy | Felicity Cox | Denis Burnham
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the AusTalk corpus, which was designed and created through the Big ASC, a collaborative project with the two main goals of providing a standardised infrastructure for audio-visual recordings in Australia and of producing a large audio-visual corpus of Australian English, with 3 hours of AV recordings for 1000 speakers. We first present the overall project, then describe the corpus itself and its components, the strict data collection protocol with high levels of standardisation and automation, and the processes put in place for quality control. We also discuss the annotation phase of the project, along with its goals and challenges; a major contribution of the project has been to explore procedures for automating annotations and we present our solutions. We conclude with the current status of the corpus and with some examples of research already conducted with this new resource. AusTalk is one of the corpora included in the HCS vLab, which is briefly sketched in the conclusion.

pdf abs
The Alveo Virtual Laboratory: A Web Based Repository API
Steve Cassidy | Dominique Estival | Timothy Jones | Denis Burnham | Jared Burghold
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Human Communication Science Virtual Laboratory (HCS vLab) is an eResearch project funded under the Australian Government NeCTAR program to build a platform for collaborative eResearch around data representing human communication and the tools that researchers use in their analysis. The human communication science field is broadly defined to encompass the study of language from various perspectives but also includes research on music and various other forms of human expression. This paper outlines the core architecture of the HCS vLab and in particular, highlights the web based API that provides access to data and tools to authenticated users.

2013

pdf
Interoperable Annotation in the Australian National Corpus
Steve Cassidy
Proceedings of the 9th Joint ISO - ACL SIGSEM Workshop on Interoperable Semantic Annotation

2012

pdf abs
The Australian National Corpus: National Infrastructure for Language Resources
Steve Cassidy | Michael Haugh | Pam Peters | Mark Fallu
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The Australian National Corpus has been established in an effort to make currently scattered and relatively inaccessible data available to researchers through an online portal. In contrast to other national corpora, it is conceptualised as a linked collection of many existing and future language resources representing language use in Australia, unified through common technical standards. This approach allows us to bootstrap a significant collection and add value to existing resources by providing a unified, online tool-set to support research in a number of disciplines. This paper provides an outline of the technical platform being developed to support the corpus and a brief overview of some of the collections that form part of the initial version of the Australian National Corpus.

Steve Cassidy

2018

2017

2016

2015

2014

2013

2012

2009

2007

2005

2002

Co-authors

Venues