A Dataset for Oral Reading in Young English Readers
Madison Rose, Michael Bennie, Valeria Pagliai, Hatice Kubra Karakis, Qian Shen, Xinyi Tai, Walter L. Leite, Zoey Liu
Abstract
Among English child speech corpora, very few focus on oral reading. Existing resources such as the CMU Kids Corpus (Ellis Weismer et al., 2013) face limitations in the lack of grade-appropriate, curriculum-aligned reading texts, the annotation scope and quality, and most crucially, comprehensive annotation scheme for characterization of children’s reading errors. This study presents a multi-layered, fully manually annotated corpus of oral reading from 63 1st-3rd grade students residing in the U.S. who grow up hearing and speaking English. Additionally, we contribute methodologically rigorous annotation guidelines that categorize 10 reading error categories and 26 sublevel error labels. Using a digital reading platform supported by GPT-4o-mini (OpenAI, 2024), children read stories on topics of their own interest, while the system records their speech and logs their interactions with embedded digital supports. Each recording is paired with detailed demographic and educational metadata and subjected to linguistic annotations, including: (1) sentence- and word-level time alignment; (2) phonemic transcription; (3) reading errors.- Anthology ID:
- 2026.conll-main.35
- Volume:
- Proceedings of the 30th Conference on Computational Natural Language Learning
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Claire Bonial, Yevgeni Berzak
- Venues:
- CoNLL | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 588–600
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.35/
- DOI:
- Cite (ACL):
- Madison Rose, Michael Bennie, Valeria Pagliai, Hatice Kubra Karakis, Qian Shen, Xinyi Tai, Walter L. Leite, and Zoey Liu. 2026. A Dataset for Oral Reading in Young English Readers. In Proceedings of the 30th Conference on Computational Natural Language Learning, pages 588–600, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- A Dataset for Oral Reading in Young English Readers (Rose et al., CoNLL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.35.pdf