What did you say? Generating Child-Directed Speech Questions to Train LLMs

Whitney Poh; Michael Tombolini; Libby Barak

What did you say? Generating Child-Directed Speech Questions to Train LLMs

Whitney Poh, Michael Tombolini, Libby Barak

Abstract

Child-Directed Speech (CDS) holds unique linguistic properties that distinguish it from other types of textual corpora. Language models trained using CDS often obtain superior results compared with the same size of different types of data. Several studies have aimed at modifying non-CDS data to mimic its linguistic properties to match the hypothesized advantageous aspects of CDS. Here, we propose to adapt the non-CDS portions of the training data to include questions similar to CDS interaction. We modify the data by adding artificially generated questions to the data and methodically analyzing the change in performance using each modified dataset. Our results show that artificial question generation strongly depends on the properties of the original dataset. While the performance improves for question-related measures, the overall performance is negatively affected as a result of the reduced syntactic diversity.

Anthology ID:: 2025.babylm-main.18
Volume:: Proceedings of the First BabyLM Workshop
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:: BabyLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 237–245
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.18/
DOI:
Bibkey:
Cite (ACL):: Whitney Poh, Michael Tombolini, and Libby Barak. 2025. What did you say? Generating Child-Directed Speech Questions to Train LLMs. In Proceedings of the First BabyLM Workshop, pages 237–245, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: What did you say? Generating Child-Directed Speech Questions to Train LLMs (Poh et al., BabyLM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.18.pdf

PDF Cite Search Fix data