2022
pdf
abs
SNuC: The Sheffield Numbers Spoken Language Corpus
Emma Barker
|
Jon Barker
|
Robert Gaizauskas
|
Ning Ma
|
Monica Lestari Paramita
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present SNuC, the first published corpus of spoken alphanumeric identifiers of the sort typically used as serial and part numbers in the manufacturing sector. The dataset contains recordings and transcriptions of over 50 native British English speakers, speaking over 13,000 multi-character alphanumeric sequences and totalling almost 20 hours of recorded speech. We describe requirements taken into account in the designing the corpus and the methodology used to construct it. We present summary statistics describing the corpus contents, as well as a preliminary investigation into errors in spoken alphanumeric identifiers. We validate the corpus by showing how it can be used to adapt a deep learning neural network based ASR system, resulting in improved recognition accuracy on the task of spoken alphanumeric identifier recognition. Finally, we discuss further potential uses for the corpus and for the tools developed to construct it.
2016
pdf
bib
Summarizing Multi-Party Argumentative Conversations in Reader Comment on News
Emma Barker
|
Robert Gaizauskas
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)
pdf
The SENSEI Annotated Corpus: Human Summaries of Reader Comment Conversations in On-line News
Emma Barker
|
Monica Lestari Paramita
|
Ahmet Aker
|
Emina Kurtic
|
Mark Hepple
|
Robert Gaizauskas
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue
pdf
Automatic label generation for news comment clusters
Ahmet Aker
|
Monica Paramita
|
Emina Kurtic
|
Adam Funk
|
Emma Barker
|
Mark Hepple
|
Rob Gaizauskas
Proceedings of the 9th International Natural Language Generation conference
pdf
abs
What’s the Issue Here?: Task-based Evaluation of Reader Comment Summarization Systems
Emma Barker
|
Monica Paramita
|
Adam Funk
|
Emina Kurtic
|
Ahmet Aker
|
Jonathan Foster
|
Mark Hepple
|
Robert Gaizauskas
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Automatic summarization of reader comments in on-line news is an extremely challenging task and a capability for which there is a clear need. Work to date has focussed on producing extractive summaries using well-known techniques imported from other areas of language processing. But are extractive summaries of comments what users really want? Do they support users in performing the sorts of tasks they are likely to want to perform with reader comments? In this paper we address these questions by doing three things. First, we offer a specification of one possible summary type for reader comment, based on an analysis of reader comment in terms of issues and viewpoints. Second, we define a task-based evaluation framework for reader comment summarization that allows summarization systems to be assessed in terms of how well they support users in a time-limited task of identifying issues and characterising opinion on issues in comments. Third, we describe a pilot evaluation in which we used the task-based evaluation framework to evaluate a prototype reader comment clustering and summarization system, demonstrating the viability of the evaluation framework and illustrating the sorts of insight such an evaluation affords.
2014
pdf
abs
Bootstrapping Term Extractors for Multiple Languages
Ahmet Aker
|
Monica Paramita
|
Emma Barker
|
Robert Gaizauskas
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Terminology extraction resources are needed for a wide range of human language technology applications, including knowledge management, information extraction, semantic search, cross-language information retrieval and automatic and assisted translation. We create a low cost method for creating terminology extraction resources for 21 non-English EU languages. Using parallel corpora and a projection method, we create a General POS Tagger for these languages. We also investigate the use of EuroVoc terms and Wikipedia corpus to automatically create term grammar for each language. Our results show that these automatically generated resources can assist term extraction process with similar performance to manually generated resources. All resources resulted in this experiment are freely available for download.
pdf
bib
Assigning Terms to Domains by Document Classification
Robert Gaizauskas
|
Emma Barker
|
Monica Lestari Paramita
|
Ahmet Aker
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)
2012
pdf
abs
Assessing the Comparability of News Texts
Emma Barker
|
Robert Gaizauskas
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Comparable news texts are frequently proposed as a potential source of alignable subsentential fragments for use in statistical machine translation systems. But can we assess just how potentially useful they will be? In this paper we first discuss a scheme for classifying news text pairs according to the degree of relatedness of the events they report and investigate how robust this classification scheme is via a multi-lingual annotation exercise. We then propose an annotation methodology, similar to that used in summarization evaluation, to allow us to identify and quantify shared content at the subsentential level in news text pairs and report a preliminary exercise to assess this method. We conclude by discussing how this works fits into a broader programme of assessing the potential utility of comparable news texts for extracting paraphrases/translational equivalents for use in language processing applications.
2006
pdf
abs
Simulating Cub Reporter Dialogues: The collection of naturalistic human-human dialogues for information access to text archives
Emma Barker
|
Ryuichiro Higashinaka
|
François Mairesse
|
Robert Gaizauskas
|
Marilyn Walker
|
Jonathan Foster
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper describes a dialogue data collection experiment and resulting corpus for dialogues between a senior mobile journalist and a junior cub reporter back at the office. The purpose of the dialogue is for the mobile journalist to collect background information in preparation for an interview or on-the-site coverage of a breaking story. The cub reporter has access to text archives that contain such background information. A unique aspect of these dialogues is that they capture information-seeking behavior for an open-ended task against a large unstructured data source. Initial analyses of the corpus show that the experimental design leads to real-time, mixedinitiative, highly interactive dialogues with many interesting properties.