Developing Partially-Transcribed Speech Corpus from Edited Transcriptions

Kengo Ohta, Masatoshi Tsuchiya, Seiichi Nakagawa


Abstract
Large-scale spontaneous speech corpora are crucial resource for various domains of spoken language processing. However, the available corpora are usually limited because their construction cost is quite expensive especially in transcribing speech precisely. On the other hand, loosely transcribed corpora like shorthand notes, meeting records and closed captions are more widely available than precisely transcribed ones, because their imperfectness reduces their construction cost. Because these corpora contain both precisely transcribed regions and edited regions, it is difficult to use them directly as speech corpora for learning acoustic models. Under this background, we have been considering to build an efficient semi-automatic framework to convert loose transcriptions to precise ones. This paper describes an improved automatic detection method of precise regions from loosely transcribed corpora for the above framework. Our detection method consists of two steps: the first step is a force alignment between loose transcriptions and their utterances to discover the corresponding utterance for the certain loose transcription, and the second step is a detector of precise regions with a support vector machine using several features obtained from the first step. Our experimental result shows that our method achieves a high accuracy of detecting precise regions, and shows that the precise regions extracted by our method are effective as training labels of lightly supervised speaker adaptation.
Anthology ID:
L12-1589
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3399–3404
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/987_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Kengo Ohta, Masatoshi Tsuchiya, and Seiichi Nakagawa. 2012. Developing Partially-Transcribed Speech Corpus from Edited Transcriptions. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3399–3404, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Developing Partially-Transcribed Speech Corpus from Edited Transcriptions (Ohta et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/987_Paper.pdf