Michael Tombolini
2026
Making Synthetic Questions More Child-Directed: Prompting and Sampling Effects
Whitney Poh | Michael Tombolini | Libby Barak
Proceedings of the 1st Workshop on Computational Developmental Linguistics (CDL)
Whitney Poh | Michael Tombolini | Libby Barak
Proceedings of the 1st Workshop on Computational Developmental Linguistics (CDL)
Child-directed Speech (CDS) has been shown to better support language learning as training data for computational models. Artificially generated input aims at replicating the advantage of CDS by re-creating targeted linguistic properties. Recently, the use of questions in CDS has been suggested as a linguistic property that may entail an effective discourse structure for model training. However, previous work has shown inconsistent improvement over baseline using questions in training data. In this study, we propose a new question generation method that aligns both the generation prompts and sampling methods with properties of CDS. We show that prompt wording substantially changes whether synthetic questions match CDS on surface properties such as MLU and question type. Despite marked improvements over baseline, enhanced CDS-likeness does not translate into consistent downstream gains. Overall, our results show that the role of questions in training data is a topic worth looking further into.
2025
What did you say? Generating Child-Directed Speech Questions to Train LLMs
Whitney Poh | Michael Tombolini | Libby Barak
Proceedings of the First BabyLM Workshop
Whitney Poh | Michael Tombolini | Libby Barak
Proceedings of the First BabyLM Workshop
Child-Directed Speech (CDS) holds unique linguistic properties that distinguish it from other types of textual corpora. Language models trained using CDS often obtain superior results compared with the same size of different types of data. Several studies have aimed at modifying non-CDS data to mimic its linguistic properties to match the hypothesized advantageous aspects of CDS. Here, we propose to adapt the non-CDS portions of the training data to include questions similar to CDS interaction. We modify the data by adding artificially generated questions to the data and methodically analyzing the change in performance using each modified dataset. Our results show that artificial question generation strongly depends on the properties of the original dataset. While the performance improves for question-related measures, the overall performance is negatively affected as a result of the reduced syntactic diversity.