Linda Brandschain


2010

pdf
Greybeard Longitudinal Speech Study
Linda Brandschain | David Graff | Christopher Cieri | Kevin Walker | Chris Caruso | Abby Neely
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Greybeard Project was designed so as to enable research in speaker recognition using data that have been collected over a long period of time. Since 1994, LDC has been collecting speech samples for use in research and evaluations. By mining our earlier collections we assembled a list of subjects who had participated in multiple studies. These participants were then contacted and asked to take part in the Greybeard Project. The only constraint was that the participants must have made numerous calls in prior studies and the calls had to be a minimum of two years old. The archived data was sorted by participant and subsequent calls were added to their files. This is the first longitudinal study of its kind. The resulting corpus contains multiple calls for each participant that span anywhere from two to 12 years in time. It is our hope that these data will enable speaker recognition researchers to explore the effects of aging on voice.

pdf
Mixer 6
Linda Brandschain | David Graff | Chris Cieri | Kevin Walker | Chris Caruso | Abby Neely
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Linguistic Data Consortium’s Human Subjects Data Collection lab conducts multi-modal speech collections to develop corpora for use in speech, speaker and language research and evaluations. The Mixer collections have evolved over the years to best accommodate the ever changing needs of the research community and to hopefully keep one step ahead by providing increasingly challenging data. Over the years Mixer collections have grown to include socio-linguistic interviews, a wide variety of telephone conditions and multiple languages, recording conditions, channels and speech acts.. Mixer 6 was the most recent collection. This paper describes the Mixer 6 Phase 1 project. Mixer 6 Phase 1 was a study supporting linguistic research, technology development and education. The object of this study was to record speech in a variety of situations that vary formality and model multiple naturally occurring interactions as well as a variety of channel conditions

2008

pdf
Speaker Recognition: Building the Mixer 4 and 5 Corpora
Linda Brandschain | Christopher Cieri | David Graff | Abby Neely | Kevin Walker
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The original Mixer corpus was designed to satisfy developing commercial and forensic needs. The resulting Mixer corpora, Phases 1 through 5, have evolved to support and increasing variety of research tasks, including multilingual and cross-channel recognition. The Mixer Phases 4 and 5 corpora feature a wider variety of channels and greater variation in the situations under which the speech is recorded. This paper focuses on the plans, progress and results of Mixer 4 and 5.

pdf
New Resources for Document Classification, Analysis and Translation Technologies
Stephanie Strassel | Lauren Friedman | Safa Ismael | Linda Brandschain
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The goal of the DARPA MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Program is to automatically convert foreign language text images into English transcripts, for use by humans and downstream applications. The first phase the program focuses on translation of handwritten Arabic documents. Linguistic Data Consortium (LDC) is creating publicly available linguistic resources for MADCAT technologies, on a scale and richness not previously available. Corpora will consist of existing LDC corpora and data donations from MADCAT partners, plus new data collection to provide high quality material for evaluation and to address strategic gaps (for genre, dialect, image quality, etc.) in the existing resources. Training and test data properties will expand over time to encompass a wide range of topics and genres: letters, diaries, training manuals, brochures, signs, ledgers, memos, instructions, postcards and forms among others. Data will be ground truthed, with line, word and token segmentation and zoning, and translations and word alignments will be produced for a subset. Evaluation data will be carefully selected from the available data pools and high quality references will be produced, which can be used to compare MADCAT system performance against the human-produced gold standard.