Francesca Padovani
2026
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data
Jaap Jumelet | Abdellah Fourtassi | Akari Haga | Bastian Bunzeck | Bhargav Shandilya | Diana Galvan-Sosa | Faiz Ghifari Haznitrama | Francesca Padovani | Francois Meyer | Hai Hu | Julen Etxaniz | Laurent Prevot | Linyang He | María Grandury | Mila Marcheva | Negar Foroutan | Nikitas Theodoropoulos | Pouya Sadeghi | Siyuan Song | Suchir Salhan | Susana Zhou | Yurii Paniv | Ziyin Zhang | Arianna Bisazza | Alex Warstadt | Leshem Choshen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Jaap Jumelet | Abdellah Fourtassi | Akari Haga | Bastian Bunzeck | Bhargav Shandilya | Diana Galvan-Sosa | Faiz Ghifari Haznitrama | Francesca Padovani | Francois Meyer | Hai Hu | Julen Etxaniz | Laurent Prevot | Linyang He | María Grandury | Mila Marcheva | Negar Foroutan | Nikitas Theodoropoulos | Pouya Sadeghi | Siyuan Song | Suchir Salhan | Susana Zhou | Yurii Paniv | Ziyin Zhang | Arianna Bisazza | Alex Warstadt | Leshem Choshen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.
CAIT: A Syntactic Parsing Toolkit for Child–Adult InTeractions
Francesca Padovani | Xiulin Yang | Bastian Bunzeck | Jaap Jumelet | Yevgen Matusevych | Nathan Schneider | Arianna Bisazza
Proceedings of the 30th Conference on Computational Natural Language Learning
Francesca Padovani | Xiulin Yang | Bastian Bunzeck | Jaap Jumelet | Yevgen Matusevych | Nathan Schneider | Arianna Bisazza
Proceedings of the 30th Conference on Computational Natural Language Learning
CHILDES is a paramount resource for language acquisition studies—yet computational tools for analyzing its syntactic structure remain limited. Leveraging the recent release of the UD-English-CHILDES treebank with gold-standard Universal Dependencies (UD) annotations, we train a state-of-the-art dependency parser specifically tailored to CHILDES. The parser more accurately captures syntactic patterns in child–adult interactions, outperforming widely used off-the-shelf English parsers, including SpaCy and Stanza. Alongside the parser, we also release a Part-of-Speech tagger and an utterance-level construction tagger, which together form the open-source Syntactic Annotation Toolkit for Child–Adult InTeractions (CAIT). Through a detailed error analysis and a case study tracking the distribution of syntactic constructions across developmental time in CHILDES, we demonstrate the practical utility of the toolkit for large-scale, reproducible research on language acquisition.
2025
Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models
Francesca Padovani | Jaap Jumelet | Yevgen Matusevych | Arianna Bisazza
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Francesca Padovani | Jaap Jumelet | Yevgen Matusevych | Arianna Bisazza
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can outperform LMs trained on an equal amount of adult-directed text like Wikipedia. However, it remains unclear whether these results generalize across languages, architectures, and evaluation settings. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in these benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.
TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
Ezgi Başar | Francesca Padovani | Jaap Jumelet | Arianna Bisazza
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Ezgi Başar | Francesca Padovani | Jaap Jumelet | Arianna Bisazza
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.
Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)
Francesca Padovani | Bastian Bunzeck | Manar Ali | Omar Momen | Arianna Bisazza | Hendrik Buschmeier | Sina Zarrieß
Proceedings of the First BabyLM Workshop
Francesca Padovani | Bastian Bunzeck | Manar Ali | Omar Momen | Arianna Bisazza | Hendrik Buschmeier | Sina Zarrieß
Proceedings of the First BabyLM Workshop
We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce “more communicative” text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.
2024
Search
Fix author
Co-authors
- Arianna Bisazza 5
- Jaap Jumelet 4
- Bastian Bunzeck 3
- Yevgen Matusevych 2
- Manar Ali 1
- Ezgi Başar 1
- Hendrik Buschmeier 1
- Leshem Choshen 1
- Julen Etxaniz 1
- Negar Foroutan 1
- Abdellah Fourtassi 1
- Martina Galletti 1
- Diana Galván-Sosa 1
- María Grandury 1
- Akari Haga 1
- Faiz Ghifari Haznitrama 1
- Linyang He 1
- Hai Hu 1
- Caterina Marchesi 1
- Mila Marcheva 1
- Francois Meyer 1
- Omar Momen 1
- Daniele Nardi 1
- Yurii Paniv 1
- Eleonora Pasqua 1
- Laurent Prévot 1
- Pouya Sadeghi 1
- Suchir Salhan 1
- Nathan Schneider 1
- Bhargav Shandilya 1
- Siyuan Song 1
- Nikitas Theodoropoulos 1
- Alex Warstadt 1
- Xiulin Yang 1
- Sina Zarrieß 1
- Ziyin Zhang 1
- Susana Zhou 1