David Graff


The SAFE-T Corpus: A New Resource for Simulated Public Safety Communications
Dana Delgado | Kevin Walker | Stephanie Strassel | Karen Jones | Christopher Caruso | David Graff
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce a new resource, the SAFE-T (Speech Analysis for Emergency Response Technology) Corpus, designed to simulate first-responder communications by inducing high vocal effort and urgent speech with situational background noise in a game-based collection protocol. Linguistic Data Consortium developed the SAFE-T Corpus to support the NIST (National Institute of Standards and Technology) OpenSAT (Speech Analytic Technologies) evaluation series, whose goal is to advance speech analytic technologies including automatic speech recognition, speech activity detection and keyword search in multiple domains including simulated public safety communications data. The corpus comprises over 300 hours of audio from 115 unique speakers engaged in a collaborative problem-solving activity representative of public safety communications in terms of speech content, noise types and noise levels. Portions of the corpus have been used in the OpenSAT 2019 evaluation and the full corpus will be published in the LDC catalog. We describe the design and implementation of the SAFE-T Corpus collection, discuss the approach of capturing spontaneous speech from study participants through game-based speech collection, and report on the collection results including several challenges associated with the collection.


Multi-language Speech Collection for NIST LRE
Karen Jones | Stephanie Strassel | Kevin Walker | David Graff | Jonathan Wright
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Multi-language Speech (MLS) Corpus supports NIST’s Language Recognition Evaluation series by providing new conversational telephone speech and broadcast narrowband data in 20 languages/dialects. The corpus was built with the intention of testing system performance in the matter of distinguishing closely related or confusable linguistic varieties, and careful manual auditing of collected data was an important aspect of this work. This paper lists the specific data requirements for the collection and provides both a commentary on the rationale for those requirements as well as an outline of the various steps taken to ensure all goals were met as specified. LDC conducted a large-scale recruitment effort involving the implementation of candidate assessment and interview techniques suitable for hiring a large contingent of telecommuting workers, and this recruitment effort is discussed in detail. We also describe the telephone and broadcast collection infrastructure and protocols, and provide details of the steps taken to pre-process collected data prior to auditing. Finally, annotation training, procedures and outcomes are presented in detail.


The RATS Collection: Supporting HLT Research with Degraded Audio Data
David Graff | Kevin Walker | Stephanie Strassel | Xiaoyi Ma | Karen Jones | Ann Sawyer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The DARPA RATS program was established to foster development of language technology systems that can perform well on speaker-to-speaker communications over radio channels that evince a wide range in the type and extent of signal variability and acoustic degradation. Creating suitable corpora to address this need poses an equally wide range of challenges for the collection, annotation and quality assessment of relevant data. This paper describes the LDC’s multi-year effort to build the RATS data collection, summarizes the content and properties of the resulting corpora, and discusses the novel problems and approaches involved in ensuring that the data would satisfy its intended use, to provide speech recordings and annotations for training and evaluating HLT systems that perform 4 specific tasks on difficult radio channels: Speech Activity Detection (SAD), Language Identification (LID), Speaker Identification (SID) and Keyword Spotting (KWS).


Developing LMF-XML Bilingual Dictionaries for Colloquial Arabic Dialects
David Graff | Mohamed Maamouri
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The Linguistic Data Consortium and Georgetown University Press are collaborating to create updated editions of bilingual diction- aries that had originally been published in the 1960's for English-speaking learners of Moroccan, Syrian and Iraqi Arabic. In their first editions, these dictionaries used ad hoc Latin-alphabet orthography for each colloquial Arabic dialect, but adopted some proper- ties of Arabic-based writing (collation order of Arabic headwords, clitic attachment to word forms in example phrases); despite their common features, there are notable differences among the three books that impede comparisons across the dialects, as well as com- parisons of each dialect to Modern Standard Arabic. In updating these volumes, we use both Arabic script and International Pho- netic Alphabet orthographies; the former provides a common basis for word recognition across dialects, while the latter provides dialect-specific pronunciations. Our goal is to preserve the full content of the original publications, supplement the Arabic headword inventory with new usages, and produce a uniform lexicon structure expressible via the Lexical Markup Framework (LMF, ISO 24613). To this end, we developed a relational database schema that applies consistently to each dialect, and HTTP-based tools for searching, editing, workflow, review and inventory management.


Greybeard Longitudinal Speech Study
Linda Brandschain | David Graff | Christopher Cieri | Kevin Walker | Chris Caruso | Abby Neely
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Greybeard Project was designed so as to enable research in speaker recognition using data that have been collected over a long period of time. Since 1994, LDC has been collecting speech samples for use in research and evaluations. By mining our earlier collections we assembled a list of subjects who had participated in multiple studies. These participants were then contacted and asked to take part in the Greybeard Project. The only constraint was that the participants must have made numerous calls in prior studies and the calls had to be a minimum of two years old. The archived data was sorted by participant and subsequent calls were added to their files. This is the first longitudinal study of its kind. The resulting corpus contains multiple calls for each participant that span anywhere from two to 12 years in time. It is our hope that these data will enable speaker recognition researchers to explore the effects of aging on voice.

Mixer 6
Linda Brandschain | David Graff | Chris Cieri | Kevin Walker | Chris Caruso | Abby Neely
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Linguistic Data Consortium’s Human Subjects Data Collection lab conducts multi-modal speech collections to develop corpora for use in speech, speaker and language research and evaluations. The Mixer collections have evolved over the years to best accommodate the ever changing needs of the research community and to hopefully keep one step ahead by providing increasingly challenging data. Over the years Mixer collections have grown to include socio-linguistic interviews, a wide variety of telephone conditions and multiple languages, recording conditions, channels and speech acts.. Mixer 6 was the most recent collection. This paper describes the Mixer 6 Phase 1 project. Mixer 6 Phase 1 was a study supporting linguistic research, technology development and education. The object of this study was to record speech in a variety of situations that vary formality and model multiple naturally occurring interactions as well as a variety of channel conditions


Speaker Recognition: Building the Mixer 4 and 5 Corpora
Linda Brandschain | Christopher Cieri | David Graff | Abby Neely | Kevin Walker
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The original Mixer corpus was designed to satisfy developing commercial and forensic needs. The resulting Mixer corpora, Phases 1 through 5, have evolved to support and increasing variety of research tasks, including multilingual and cross-channel recognition. The Mixer Phases 4 and 5 corpora feature a wider variety of channels and greater variation in the situations under which the speech is recorded. This paper focuses on the plans, progress and results of Mixer 4 and 5.


Lexicon Development for Varieties of Spoken Colloquial Arabic
David Graff | Tim Buckwalter | Mohamed Maamouri | Hubert Jin
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In Arabic speech communities, there is a diglossic gap between written/formal Modern Standard Arabic (MSA) and spoken/casual colloquial dialectal Arabic (DA): the common spoken language has no standard representation in written form, while the language observed in texts has limited occurrence in speech. Hence the task of developing language resources to describe and model DA speech involves extra work to establish conventions for orthography and grammatical analysis. We describe work being done at the LDC to develop lexicons for DA, comprising pronunciation, morphology and part-of-speech labeling for word forms in recorded speech. Components of the approach are: (a) a two-layer transcription, providing a consonant-skeleton form and a pronunciation form; (b) manual annotation of morphology, part-of-speech and English gloss, followed by development of automatic word parsers modeled on the Buckwalter Morphological Analyzer for MSA; (c) customized user interfaces and supporting tools for all stages of annotation; and (d) a relational database for storing, emending and publishing the transcription corpus as well as the lexicon.


Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts
Christopher Cieri | David Graff | Mark Liberman | Nii Martey | Stephanie Strassel
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Quality Control in Large Annotation Projects Involving Multiple Judges: The Case of the TDT Corpora
Stephanie Strassel | David Graff | Nii Martey | Christopher Cieri
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Many Uses, Many Annotations for Large Speech Corpora: Switchboard and TDT as Case Studies
David Graff | Steven Bird
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)


Multilingual Text Resources at the Linguistic Data Consortium
David Graff | Rebecca Finch
Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994