Kevin Walker

2020

We introduce a new resource, the SAFE-T (Speech Analysis for Emergency Response Technology) Corpus, designed to simulate first-responder communications by inducing high vocal effort and urgent speech with situational background noise in a game-based collection protocol. Linguistic Data Consortium developed the SAFE-T Corpus to support the NIST (National Institute of Standards and Technology) OpenSAT (Speech Analytic Technologies) evaluation series, whose goal is to advance speech analytic technologies including automatic speech recognition, speech activity detection and keyword search in multiple domains including simulated public safety communications data. The corpus comprises over 300 hours of audio from 115 unique speakers engaged in a collaborative problem-solving activity representative of public safety communications in terms of speech content, noise types and noise levels. Portions of the corpus have been used in the OpenSAT 2019 evaluation and the full corpus will be published in the LDC catalog. We describe the design and implementation of the SAFE-T Corpus collection, discuss the approach of capturing spontaneous speech from study participants through game-based speech collection, and report on the collection results including several challenges associated with the collection.

pdf bib abs
Call My Net 2: A New Resource for Speaker Recognition
Karen Jones | Stephanie Strassel | Kevin Walker | Jonathan Wright
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce the Call My Net 2 (CMN2) Corpus, a new resource for speaker recognition featuring Tunisian Arabic conversations between friends and family, incorporating both traditional telephony and VoIP data. The corpus contains data from over 400 Tunisian Arabic speakers collected via a custom-built platform deployed in Tunis, with each speaker making 10 or more calls each lasting up to 10 minutes. Calls include speech in various realistic and natural acoustic settings, both noisy and non-noisy. Speakers used a variety of handsets, including landline and mobile devices, and made VoIP calls from tablets or computers. All calls were subject to a series of manual and automatic quality checks, including speech duration, audio quality, language identity and speaker identity. The CMN2 corpus has been used in two NIST Speaker Recognition Evaluations (SRE18 and SRE19), and the SRE test sets as well as the full CMN2 corpus will be published in the Linguistic Data Consortium Catalog. We describe CMN2 corpus requirements, the telephone collection platform, and procedures for call collection. We review properties of the CMN2 dataset and discuss features of the corpus that distinguish it from prior SRE collection efforts, including some of the technical challenges encountered with collecting VoIP data.

2016

pdf bib abs
Multi-language Speech Collection for NIST LRE
Karen Jones | Stephanie Strassel | Kevin Walker | David Graff | Jonathan Wright
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Multi-language Speech (MLS) Corpus supports NIST’s Language Recognition Evaluation series by providing new conversational telephone speech and broadcast narrowband data in 20 languages/dialects. The corpus was built with the intention of testing system performance in the matter of distinguishing closely related or confusable linguistic varieties, and careful manual auditing of collected data was an important aspect of this work. This paper lists the specific data requirements for the collection and provides both a commentary on the rationale for those requirements as well as an outline of the various steps taken to ensure all goals were met as specified. LDC conducted a large-scale recruitment effort involving the implementation of candidate assessment and interview techniques suitable for hiring a large contingent of telecommuting workers, and this recruitment effort is discussed in detail. We also describe the telephone and broadcast collection infrastructure and protocols, and provide details of the steps taken to pre-process collected data prior to auditing. Finally, annotation training, procedures and outcomes are presented in detail.

2014

The DARPA BOLT Program develops systems capable of allowing English speakers to retrieve and understand information from informal foreign language genres. Phase 2 of the program required large volumes of naturally occurring informal text (SMS) and chat messages from individual users in multiple languages to support evaluation of machine translation systems. We describe the design and implementation of a robust collection system capable of capturing both live and archived SMS and chat conversations from willing participants. We also discuss the challenges recruitment at a time when potential participants have acute and growing concerns about their personal privacy in the realm of digital communication, and we outline the techniques adopted to confront those challenges. Finally, we review the properties of the resulting BOLT Phase 2 Corpus, which comprises over 6.5 million words of naturally-occurring chat and SMS in English, Chinese and Egyptian Arabic.

pdf bib abs
The RATS Collection: Supporting HLT Research with Degraded Audio Data
David Graff | Kevin Walker | Stephanie Strassel | Xiaoyi Ma | Karen Jones | Ann Sawyer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The DARPA RATS program was established to foster development of language technology systems that can perform well on speaker-to-speaker communications over radio channels that evince a wide range in the type and extent of signal variability and acoustic degradation. Creating suitable corpora to address this need poses an equally wide range of challenges for the collection, annotation and quality assessment of relevant data. This paper describes the LDCs multi-year effort to build the RATS data collection, summarizes the content and properties of the resulting corpora, and discusses the novel problems and approaches involved in ensuring that the data would satisfy its intended use, to provide speech recordings and annotations for training and evaluating HLT systems that perform 4 specific tasks on difficult radio channels: Speech Activity Detection (SAD), Language Identification (LID), Speaker Identification (SID) and Keyword Spotting (KWS).

2010

pdf bib abs
Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development
Kevin Walker | Christopher Caruso | Denise DiPersio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The development of technologies to address machine translation and distillation of multilingual broadcast data depends heavily on the collection of large volumes of material from modern data providers. To address the needs of GALE researchers, the Linguistic Data Consortium (LDC) developed a system for collecting broadcast news and conversation from a variety of Arabic, Chinese and English broadcasters. The system is highly automated, easily extensible and robust and is capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. In addition to this extensive system, LDC manages three remote collection sites to maximize the variety of available broadcast data and has designed a portable broadcast collection platform to facilitate remote collection. This paper will present a detailed a description of the design and implementation of LDCs collection system, the technical challenges and solutions to large scale broadcast data collection efforts and an overview of the systems operation. This paper will also discuss the challenges of managing remote collections, in particular, the strategies used to normalize data formats, naming conventions and delivery methods to achieve optimal integration of remotely-collected data into LDCs collection database and downstream tasking workflow.

The Greybeard Project was designed so as to enable research in speaker recognition using data that have been collected over a long period of time. Since 1994, LDC has been collecting speech samples for use in research and evaluations. By mining our earlier collections we assembled a list of subjects who had participated in multiple studies. These participants were then contacted and asked to take part in the Greybeard Project. The only constraint was that the participants must have made numerous calls in prior studies and the calls had to be a minimum of two years old. The archived data was sorted by participant and subsequent calls were added to their files. This is the first longitudinal study of its kind. The resulting corpus contains multiple calls for each participant that span anywhere from two to 12 years in time. It is our hope that these data will enable speaker recognition researchers to explore the effects of aging on voice.

Linguistic Data Consortiums Human Subjects Data Collection lab conducts multi-modal speech collections to develop corpora for use in speech, speaker and language research and evaluations. The Mixer collections have evolved over the years to best accommodate the ever changing needs of the research community and to hopefully keep one step ahead by providing increasingly challenging data. Over the years Mixer collections have grown to include socio-linguistic interviews, a wide variety of telephone conditions and multiple languages, recording conditions, channels and speech acts.. Mixer 6 was the most recent collection. This paper describes the Mixer 6 Phase 1 project. Mixer 6 Phase 1 was a study supporting linguistic research, technology development and education. The object of this study was to record speech in a variety of situations that vary formality and model multiple naturally occurring interactions as well as a variety of channel conditions

2008

pdf bib abs
Speaker Recognition: Building the Mixer 4 and 5 Corpora
Linda Brandschain | Christopher Cieri | David Graff | Abby Neely | Kevin Walker
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The original Mixer corpus was designed to satisfy developing commercial and forensic needs. The resulting Mixer corpora, Phases 1 through 5, have evolved to support and increasing variety of research tasks, including multilingual and cross-channel recognition. The Mixer Phases 4 and 5 corpora feature a wider variety of channels and greater variation in the situations under which the speech is recorded. This paper focuses on the plans, progress and results of Mixer 4 and 5.

2006

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.

pdf bib abs
Low-cost Customized Speech Corpus Creation for Speech Technology Applications
Kazuaki Maeda | Christopher Cieri | Kevin Walker
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Speech technology applications, such as speech recognition, speech synthesis, and speech dialog systems, often require corpora based on highly customized specifications. Existing corpora available to the community, such as TIMIT and other corpora distributed by LDC and ELDA, do not always meet the requirements of such applications. In such cases, the developers need to create their own corpora. The creation of a highly customized speech corpus, however, could be a very expensive and time-consuming task, especially for small organizations. It requires multidisciplinary expertise in linguistics, management and engineering as it involves subtasks such as the corpus design, human subject recruitment, recording, quality assurance, and in some cases, segmentation, transcription and annotation. This paper describes LDC's recent involvement in the creation of a low-cost yet highly-customized speech corpus for a commercial organization under a novel data creation and licensing model, which benefits both the particular data requester and the general linguistic data user community.

2004

pdf bib
The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text
Christopher Cieri | David Miller | Kevin Walker
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
The Mixer Corpus of Multilingual, Multichannel Speaker Recognition Data
Christopher Cieri | Joseph P. Campbell | Hirotaka Nakasone | David Miller | Kevin Walker
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Co-authors

Venues

LREC13