2025
pdf
bib
abs
What is the Best Sequence Length for BabyLM?
Suchir Salhan
|
Richard Diehl Martinez
|
Zebulon Goriely
|
Paula Buttery
Proceedings of the First BabyLM Workshop
Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.
pdf
bib
abs
BLiSS: Evaluating Bilingual Learner Competence in Second Language Small Language Models
Yuan Gao
|
Suchir Salhan
|
Andrew Caines
|
Paula Buttery
|
Weiwei Sun
Proceedings of the First BabyLM Workshop
Cross-lingual extensions of the BabyLM Shared Task beyond English incentivise the development of Small Language Models that simulate a much wider range of language acquisition scenarios, including code-switching, simultaneous and successive bilingualism and second language acquisition. However, to our knowledge, there is no benchmark of the formal competence of cognitively-inspired models of L2 acquisition, or L2LMs. To address this, we introduce a Benchmark of Learner Interlingual Syntactic Structure (BLiSS). BLiSS consists of 1.5M naturalistic minimal pairs dataset derived from errorful sentence–correction pairs in parallel learner corpora. These are systematic patterns –overlooked by standard benchmarks of the formal competence of Language Models – which we use to evaluate L2LMs trained in a variety of training regimes on specific properties of L2 learner language to provide a linguistically-motivated framework for controlled measure of the interlanguage competence of L2LMs.
pdf
bib
abs
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling
Bianca-Mihaela Ganescu
|
Suchir Salhan
|
Andrew Caines
|
Paula Buttery
Proceedings of the First BabyLM Workshop
Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.
pdf
bib
abs
Teacher Demonstrations in a BabyLM’s Zone of Proximal Development for Contingent Multi-Turn Interaction
Suchir Salhan
|
Hongyi Gu
|
Donya Rooein
|
Diana Galvan-Sosa
|
Gabrielle Gaudeau
|
Andrew Caines
|
Zheng Yuan
|
Paula Buttery
Proceedings of the First BabyLM Workshop
Multi-turn dialogues between a child and caregiver are characterized by a property called contingency – prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a Teacher–Student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive Teacher decoding strategies show limited additional gains. ContingentChat highlights the positive benefits of targeted post-training on dialogue quality and presents contingency as a challenging goal for BabyLMs.
pdf
bib
abs
Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research
Richard Diehl Martinez
|
David Demitri Africa
|
Yuval Weiss
|
Suchir Salhan
|
Ryan Daniels
|
Paula Buttery
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Building language models (LMs), especially small and medium ones, remains more art than science. While large LMs often improve by sheer scale, it is still unclear why many design choices work. For small LMs, this uncertainty is more limiting: tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce Pico, a lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Pico consists of two libraries that together provide a practical sandbox where researchers can make targeted changes to a model’s architecture or training procedures and directly observe their effects on the model’s behavior. To support reproducible experimentation, we also release a suite of baseline models, pico-decoder, trained under standardized conditions and open-sourced for the community. Case studies highlight how Pico can support iterative small LM design and analysis.
pdf
bib
abs
Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages
David Demitri Africa
|
Suchir Salhan
|
Yuval Weiss
|
Paula Buttery
|
Richard Diehl Martinez
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Named-entity recognition (NER) in low-resource languages is usually tackled by finetuning very large multilingual LMs, an option that is often infeasible in memory- or latency-constrained settings. We ask whether small decoder LMs can be pretrained so that they adapt quickly and transfer zero-shot to languages unseen during pretraining. To this end we replace part of the autoregressive objective with first-order model-agnostic meta-learning (MAML). Tagalog and Cebuano are typologically similar yet structurally different in their actor/non-actor voice systems, and hence serve as a challenging test-bed. Across four model sizes (11 M – 570 M) MAML lifts zero-shot micro-F1 by 2–6 pp under head-only tuning and 1–3 pp after full tuning, while cutting convergence time by up to 8%. Gains are largest for single-token person entities that co-occur with Tagalog case particles si/ni, highlighting the importance of surface anchors.
pdf
bib
abs
Extended Abstract for “Linguistic Universals”: Emergent Shared Features in Independent Monolingual Language Models via Sparse Autoencoders
Ej Zhou
|
Suchir Salhan
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Do independently trained monolingual language models converge on shared linguistic principles? To explore this question, we propose to analyze a suite of models trained separately on single languages but with identical architectures and budgets. We train sparse autoencoders (SAEs) on model activations to obtain interpretable latent features, then align them across languages using activation correlations. We do pairwise analyses to see if feature spaces show non-trivial convergence, and we identify universal features that consistently emerge across diverse models. Positive results will provide evidence that certain high-level regularities in language are rediscovered independently in machine learning systems.
2024
pdf
bib
abs
Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies
Suchir Salhan
|
Richard Diehl Martinez
|
Zébulon Goriely
|
Paula Buttery
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Curriculum Learning has been a popular strategy to improve the cognitive plausibility of Small-Scale Language Models (SSLMs) in the BabyLM Challenge. However, it has not led to considerable improvements over non-curriculum models. We assess whether theoretical linguistic acquisition theories can be used to specify more fine-grained curriculum learning strategies, creating age-ordered corpora of Child-Directed Speech for four typologically distant language families to implement SSLMs and acquisition-inspired curricula cross-lingually. Comparing the success of three objective curricula (Growing, Inwards & MMM) that precisely replicate the predictions of acquisition theories on a standard SSLM architecture, we find fine-grained acquisition-inspired curricula can outperform non-curriculum baselines and performance benefits of curricula strategies in SSLMs can be derived by specifying fine-grained language-specific curricula that precisely replicate language acquisition theories.