Ann Sawyer
2014
Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus
Zhiyi Song
|
Stephanie Strassel
|
Haejoong Lee
|
Kevin Walker
|
Jonathan Wright
|
Jennifer Garland
|
Dana Fore
|
Brian Gainor
|
Preston Cabe
|
Thomas Thomas
|
Brendan Callahan
|
Ann Sawyer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The DARPA BOLT Program develops systems capable of allowing English speakers to retrieve and understand information from informal foreign language genres. Phase 2 of the program required large volumes of naturally occurring informal text (SMS) and chat messages from individual users in multiple languages to support evaluation of machine translation systems. We describe the design and implementation of a robust collection system capable of capturing both live and archived SMS and chat conversations from willing participants. We also discuss the challenges recruitment at a time when potential participants have acute and growing concerns about their personal privacy in the realm of digital communication, and we outline the techniques adopted to confront those challenges. Finally, we review the properties of the resulting BOLT Phase 2 Corpus, which comprises over 6.5 million words of naturally-occurring chat and SMS in English, Chinese and Egyptian Arabic.
The RATS Collection: Supporting HLT Research with Degraded Audio Data
David Graff
|
Kevin Walker
|
Stephanie Strassel
|
Xiaoyi Ma
|
Karen Jones
|
Ann Sawyer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The DARPA RATS program was established to foster development of language technology systems that can perform well on speaker-to-speaker communications over radio channels that evince a wide range in the type and extent of signal variability and acoustic degradation. Creating suitable corpora to address this need poses an equally wide range of challenges for the collection, annotation and quality assessment of relevant data. This paper describes the LDCs multi-year effort to build the RATS data collection, summarizes the content and properties of the resulting corpora, and discusses the novel problems and approaches involved in ensuring that the data would satisfy its intended use, to provide speech recordings and annotations for training and evaluating HLT systems that perform 4 specific tasks on difficult radio channels: Speech Activity Detection (SAD), Language Identification (LID), Speaker Identification (SID) and Keyword Spotting (KWS).
Search
Co-authors
- Brendan Callahan 1
- Brian Gainor 1
- Dana Fore 1
- David Graff 1
- Haejoong Lee 1
- show all...
Venues
- lrec2