Abstract
Recent advances in modeling early language acquisition are due not only to the development of machine-learning techniques, but also to the increasing availability of data on child language and child-adult interaction. In the absence of recordings of child-directed speech, or when models explicitly require such a representation for training data, phonemic transcriptions are commonly used as input data. We present a novel (and to our knowledge, the first) phonemic corpus of Polish child-directed speech. It is derived from the Weist corpus of Polish, freely available from the seminal CHILDES database. For the sake of reproducibility, and to exemplify the typical trade-off between ecological validity and sample size, we report all preprocessing operations and transcription guidelines. Contributed linguistic resources include updated CHAT-formatted transcripts with phonemic transcriptions in a novel phonology tier, as well as by-product data, such as a phonemic lexicon of Polish. All resources are distributed under the LGPL-LR license.- Anthology ID:
- L12-1660
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 1017–1020
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/1120_Paper.pdf
- DOI:
- Cite (ACL):
- Luc Boruta and Justyna Jastrzebska. 2012. A Phonemic Corpus of Polish Child-Directed Speech. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1017–1020, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- A Phonemic Corpus of Polish Child-Directed Speech (Boruta & Jastrzebska, LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/1120_Paper.pdf