Simon Corston-Oliver

Also published as: Simon H. Corston-Oliver

2021

pdf bib abs
Improving Punctuation Restoration for Speech Transcripts via External Data
Xue-Yong Fu | Cheng Chen | Md Tahmid Rahman Laskar | Shashi Bhushan | Simon Corston-Oliver
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Automatic Speech Recognition (ASR) systems generally do not produce punctuated transcripts. To make transcripts more readable and follow the expected input format for downstream language models, it is necessary to add punctuation marks. In this paper, we tackle the punctuation restoration problem specifically for the noisy text (e.g., phone conversation scenarios). To leverage the available written text datasets, we introduce a data sampling technique based on an n-gram language model to sample more training data that are similar to our in-domain data. Moreover, we propose a two-stage fine-tuning approach that utilizes the sampled external data as well as our in-domain dataset for models based on BERT. Extensive experiments show that the proposed approach outperforms the baseline with an improvement of 1.12% F1 score.

2006

pdf bib
Multilingual Dependency Parsing using Bayes Point Machines
Simon Corston-Oliver | Anthony Aue | Kevin Duh | Eric Ringger
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf bib
The impact of parse quality on syntactically-informed statistical machine translation
Chris Quirk | Simon Corston-Oliver
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

pdf bib
Dependency Parsing with Reference to Slovene, Spanish and Swedish
Simon Corston-Oliver | Anthony Aue
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

2004

pdf bib
Linguistically Informed Statistical Models of Constituent Structure for Ordering in Sentence Realization
Eric Ringger | Michael Gamon | Robert C. Moore | David Rojas | Martine Smets | Simon Corston-Oliver
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Task-Focused Summarization of Email
Simon Corston-Oliver | Eric Ringger | Michael Gamon | Richard Campbell
Text Summarization Branches Out

pdf bib abs
Normalizing German and English inflectional morphology to improve statistical word alignment
Simon Corston-Oliver | Michael Gamon
Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Technical Papers

German has a richer system of inflectional morphology than English, which causes problems for current approaches to statistical word alignment. Using Giza++ as a reference implementation of the IBM Model 1, an HMMbased alignment and IBM Model 4, we measure the impact of normalizing inflectional morphology on German-English statistical word alignment. We demonstrate that normalizing inflectional morphology improves the perplexity of models and reduces alignment errors.

2003

pdf bib
French Amalgam: a quick adaptation of a sentence realization system to French
Martine Smets | Michael Gamon | Simon Corston-Oliver | Eric Ringger
10th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs
Combining decision trees and transformation-based learning to correct transferred linguistic representations
Simon Corston-Oliver | Michael Gamon
Proceedings of Machine Translation Summit IX: Papers

We approach to correcting features in transferred linguistic representations in machine translation. The hybrid approach combines decision trees and transformation-based learning. Decision trees serve as a filter on the intractably large search space of possible interrelations among features. Transformation-based learning results in a simple set of ordered rules that can be compiled and executed after transfer and before sentence realization in the target language. We measure the reduction in noise in the linguistic representations and the results of human evaluations of end-to-end English-German machine translation.

pdf bib abs
French Amalgam: A machine-learned sentence realization system
Martine Smets | Michael Gamon | Simon Corston-Oliver | Eric Ringger
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

This paper presents the French implementation of Amalgam, a machine-learned sentence realization system. It presents in some detail two of the machine-learned models employed in Amalgam and shows how linguistic intuition and knowledge can be combined with statistical techniques to improve the performance of the models.

2002

pdf bib
Machine-learned contexts for linguistic operations in German sentence realization
Michael Gamon | Eric Ringger | Simon Corston-Oliver | Robert Moore
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

pdf bib
Extraposition: A Case Study in German Sentence Realization
Michael Gamon | Eric Ringger | Zhu Zhang | Robert Moore | Simon Corston-Oliver
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
An Overview of Amalgam: A Machine-learned Generation Module
Simon Corston-Oliver | Michael Gamon | Eric Ringger | Robert Moore
Proceedings of the International Natural Language Generation Conference

2001

pdf bib
A Machine Learning Approach to the Automatic Evaluation of Machine Translation
Simon Corston-Oliver | Michael Gamon | Chris Brockett
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

pdf bib abs
Using machine learning for system-internal evaluation of transferred linguistic representations
Michael Gamon | Hisami Suzuki | Simon Corston-Oliver
Proceedings of Machine Translation Summit VIII

We present an automated, system-internal evaluation technique for linguistic representations in a large-scale, multilingual MT system. We use machine-learned classifiers to recognize the differences between linguistic representations generated from transfer in an MT context from representations that are produced by "native" analysis of the target language. In the MT scenario, convergence of the two is the desired result. Holding the feature set and the learning algorithm constant, the accuracy of the classifiers provides a measure of the overall difference between the two sets of linguistic representations: classifiers with higher accuracy correspond to more pronounced differences between representations. More importantly, the classifiers yield the basis for error-analysis by providing a ranking of the importance of linguistic features. The more salient a linguistic criterion is in discriminating transferred representations from "native" representations, the more work will be needed in order to get closer to the goal of producing native-like MT. We present results from using this approach on the Microsoft MT system and discuss its advantages and possible extensions.