2010
pdf
abs
A Very Large Scale Mandarin Chinese Broadcast Corpus for GALE Project
Yi Liu
|
Pascale Fung
|
Yongsheng Yang
|
Denise DiPersio
|
Meghan Glenn
|
Stephanie Strassel
|
Christopher Cieri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes broadcast news (BN) and broadcast conversation (BC) including talk shows, roundtable discussions, call-in shows, editorials and other conversational programs that focus on news and current events. HKUST also collects detailed information about all recorded programs. A subset of BC and BN recordings are manually transcribed with standard Chinese characters in UTF-8 encoding, using specific mark-ups for a small set of spontaneous and conversational speech phenomena. The collection is among the largest and first of its kind for Mandarin Chinese Broadcast speech, providing abundant and diverse samples for Mandarin speech recognition and other application-dependent tasks, such as spontaneous speech processing and recognition, topic detection, information retrieval, and speaker recognition. HKUSTâs acoustic analysis of 500 hours of the speech and transcripts demonstrates the positive impact this data could have on system performance.
pdf
abs
Transcription Methods for Consistency, Volume and Efficiency
Meghan Lammie Glenn
|
Stephanie M. Strassel
|
Haejoong Lee
|
Kazuaki Maeda
|
Ramez Zakhary
|
Xuansong Li
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper describes recent efforts at Linguistic Data Consortium at the University of Pennsylvania to create manual transcripts as a shared resource for human language technology research and evaluation. Speech recognition and related technologies in particular call for substantial volumes of transcribed speech for use in system development, and for human gold standard references for evaluating performance over time. Over the past several years LDC has developed a number of transcription approaches to support the varied goals of speech technology evaluation programs in multiple languages and genres. We describe each transcription method in detail, and report on the results of a comparative analysis of transcriber consistency and efficiency, for two transcription methods in three languages and five genres. Our findings suggest that transcripts for planned speech are generally more consistent than those for spontaneous speech, and that careful transcription methods result in higher rates of agreement when compared to quick transcription methods. We conclude with a general discussion of factors contributing to transcription quality, efficiency and consistency.
2008
pdf
abs
Bridging the Gap between Linguists and Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and Speaker Recognition
Christopher Cieri
|
Stephanie Strassel
|
Meghan Glenn
|
Reva Schwartz
|
Wade Shen
|
Joseph Campbell
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Recent years have seen increased interest within the speaker recognition community in high-level features including, for example, lexical choice, idiomatic expressions or syntactic structures. The promise of speaker recognition in forensic applications drives development toward systems robust to channel differences by selecting features inherently robust to channel difference. Within the language recognition community, there is growing interest in differentiating not only languages but also mutually intelligible dialects of a single language. Decades of research in dialectology suggest that high-level features can enable systems to cluster speakers according to the dialects they speak. The Phanotics (Phonetic Annotation of Typicality in Conversational Speech) project seeks to identify high-level features characteristic of American dialects, annotate a corpus for these features, use the data to dialect recognition systems and also use the categorization to create better models for speaker recognition. The data, once published, should be useful to other developers of speaker and dialect recognition systems and to dialectologists and sociolinguists. We expect the methods will generalize well beyond the speakers, dialects, and languages discussed here and should, if successful, provide a model for how linguists and technology developers can collaborate in the future for the benefit of both groups and toward a deeper understanding of how languages vary and change.
pdf
abs
Quick Rich Transcriptions of Arabic Broadcast News Speech Data
Chomicha Bendahman
|
Meghan Glenn
|
Djamel Mostefa
|
Niklas Paulsson
|
Stephanie Strassel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper describes the collect and transcription of a large set of Arabic broadcast news speech data. A total of more than 2000 hours of data was transcribed. The transcription factor for transcribing the broadcast news data has been reduced using a method such as Quick Rich Transcription (QRTR) as well as reducing the number of quality controls performed on the data. The data was collected from several Arabic TV and radio sources and from both Modern Standard Arabic and dialectal Arabic. The orthographic transcriptions included segmentation, speaker turns, topics, sentence unit types and a minimal noise mark-up. The transcripts were produced as a part of the GALE project.
pdf
abs
Management of Large Annotation Projects Involving Multiple Human Judges: a Case Study of GALE Machine Translation Post-editing
Meghan Lammie Glenn
|
Stephanie Strassel
|
Lauren Friedman
|
Haejoong Lee
|
Shawn Medero
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Managing large groups of human judges to perform any annotation task is a challenge. Linguistic Data Consortium coordinated the creation of manual machine translation post-editing results for the DARPA Global Autonomous Language Exploration Program. Machine translation is one of three core technology components for GALE, which includes an annual MT evaluation administered by National Institute of Standards and Technology. Among the training and test data LDC creates for the GALE program are gold standard translations for system evaluation. The GALE machine translation system evaluation metric is edit distance, measured by HTER (human translation edit rate), which calculates the minimum number of changes required for highly-trained human editors to correct MT output so that it has the same meaning as the reference translation. LDC has been responsible for overseeing the post-editing process for GALE. We describe some of the accomplishments and challenges of completing the post-editing effort, including developing a new web-based annotation workflow system, and recruiting and training human judges for the task. In addition, we suggest that the workflow system developed for post-editing could be ported efficiently to other annotation efforts.
2007
Linguistic resources in support of various evaluation metrics
Christopher Cieri
|
Stephanie Strassel
|
Meghan Lammie Glenn
|
Lauren Friedman
Proceedings of the Workshop on Automatic procedures in MT evaluation