Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models

Francesca Padovani; Jaap Jumelet; Yevgen Matusevych; Arianna Bisazza

Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models

Francesca Padovani, Jaap Jumelet, Yevgen Matusevych, Arianna Bisazza

Abstract

Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can outperform LMs trained on an equal amount of adult-directed text like Wikipedia. However, it remains unclear whether these results generalize across languages, architectures, and evaluation settings. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in these benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.

Anthology ID:: 2025.emnlp-main.999
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19746–19767
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.999/
DOI:
Bibkey:
Cite (ACL):: Francesca Padovani, Jaap Jumelet, Yevgen Matusevych, and Arianna Bisazza. 2025. Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19746–19767, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models (Padovani et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.999.pdf
Checklist:: 2025.emnlp-main.999.checklist.pdf

PDF Cite Search Checklist Fix data