Abstract
This paper describes the construction and usage of the MOR and GRASP programs for part of speech tagging and syntactic dependency analysis of the corpora in the CHILDES and TalkBank databases. We have written MOR grammars for 11 languages and GRASP analyses for three. For English data, the MOR tagger reaches 98% accuracy on adult corpora and 97% accuracy on child language corpora. The paper discusses the construction of MOR lexicons with an emphasis on compounds and special conversational forms. The shape of rules for controlling allomorphy and morpheme concatenation are discussed. The analysis of bilingual corpora is illustrated in the context of the Cantonese-English bilingual corpora. Methods for preparing data for MOR analysis and for developing MOR grammars are discussed. We believe that recent computational work using this system is leading to significant advances in child language acquisition theory and theories of grammar identification more generally.- Anthology ID:
- L12-1353
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2375–2380
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/616_Paper.pdf
- DOI:
- Cite (ACL):
- Brian MacWhinney. 2012. Morphosyntactic Analysis of the CHILDES and TalkBank Corpora. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2375–2380, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- Morphosyntactic Analysis of the CHILDES and TalkBank Corpora (MacWhinney, LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/616_Paper.pdf