A Dataset for Oral Reading in Young English Readers

Madison Rose, Michael Bennie, Valeria Pagliai, Hatice Kubra Karakis, Qian Shen, Xinyi Tai, Walter L. Leite, Zoey Liu


Abstract
Among English child speech corpora, very few focus on oral reading. Existing resources such as the CMU Kids Corpus (Ellis Weismer et al., 2013) face limitations in the lack of grade-appropriate, curriculum-aligned reading texts, the annotation scope and quality, and most crucially, comprehensive annotation scheme for characterization of children’s reading errors. This study presents a multi-layered, fully manually annotated corpus of oral reading from 63 1st-3rd grade students residing in the U.S. who grow up hearing and speaking English. Additionally, we contribute methodologically rigorous annotation guidelines that categorize 10 reading error categories and 26 sublevel error labels. Using a digital reading platform supported by GPT-4o-mini (OpenAI, 2024), children read stories on topics of their own interest, while the system records their speech and logs their interactions with embedded digital supports. Each recording is paired with detailed demographic and educational metadata and subjected to linguistic annotations, including: (1) sentence- and word-level time alignment; (2) phonemic transcription; (3) reading errors.
Anthology ID:
2026.conll-main.35
Volume:
Proceedings of the 30th Conference on Computational Natural Language Learning
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Claire Bonial, Yevgeni Berzak
Venues:
CoNLL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
588–600
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.35/
DOI:
Bibkey:
Cite (ACL):
Madison Rose, Michael Bennie, Valeria Pagliai, Hatice Kubra Karakis, Qian Shen, Xinyi Tai, Walter L. Leite, and Zoey Liu. 2026. A Dataset for Oral Reading in Young English Readers. In Proceedings of the 30th Conference on Computational Natural Language Learning, pages 588–600, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
A Dataset for Oral Reading in Young English Readers (Rose et al., CoNLL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.35.pdf