BabyLM Challenge (2025)

Volumes

Proceedings of the First BabyLM Workshop 41 papers

pdf (full)
bib (full) Proceedings of the First BabyLM Workshop

pdf bib abs
Rethinking the Role of Text Complexity in Language Model Pretraining
Dan John Velasco | Matthew Theodore Roque

Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity—how hard a text is to read—remains less explored. We reduce surface-level complexity (shorter sentences, simpler words, simpler structure) while keeping core content approximately constant and ask: (i) How does complexity affect language modeling across model sizes? (ii) Can useful representations be learned from simpler text alone? (iii) How does pretraining text complexity influence downstream language understanding? We simplify human-written texts using a large language model, pretrain causal models (28M–500M) from scratch on original vs. simplified data, and evaluate them in fine-tuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity—smaller models degrade far less on simpler texts—while text complexity has little impact on fine-tuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking. Our findings suggest that different types of data diversity affect transfer and zero-shot performance differently, providing insight into tailoring data curation to specific goals.

pdf bib abs
Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling
Jannek Ulm | Kevin Du | Vésteinn Snæbjarnarson

Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of *contrastive decoding* for generating synthetic data. In a controlled setting, we experiment with sampling corpora using the relative difference between a GOOD and BAD model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks.In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more *reasoning skills*, while synthetic data from traditional sampling helps more on tasks requiring surface-level *linguistic* capabilities.

pdf bib abs
Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models
Sushant Mehta | Raj Dandekar | Rajat Dandekar | Sreedath Panat

We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient small language models. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top-k selection, enabling flexible specialization through \binom{62}{6} ≈ 3.6 × 10⁷ possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization. Extensive experiments on models ranging from 17M to 202M parameters demonstrate that with compression ratio r=d/2 achieves 68% KV cache memory reduction and 3.2× inference speedup while maintaining competitive perplexity (0.8% degradation). Compared to the parameters with 53.9M parameters, improves the validation loss by 6.9% over the vanilla transformers while using 42% fewer active parameters per forward pass. FLOP-matched experiments reveal even larger gains: 11.1% improvement with 3.2× inference acceleration. Automated evaluation using GPT-4 as a judge confirms quality improvements in generation, with higher scores on coherence (8.1/10), creativity (7.9/10) and grammatical correctness (8.2/10). Our results establish that architectural synergy, not parameter scaling, defines the efficiency frontier for resource-constrained language model deployment.

pdf bib abs
Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models
Raha Askari | Sina Zarrieß | Özge Alacam | Judith Sieker

Implicit meanings are integral to human communication, making it essential for language models to be capable of identifying and interpreting them. Grice (1975) proposed a set of conversational maxims that guide cooperative dialogue, noting that speakers may deliberately violate these principles to express meanings beyond literal words, and that listeners, in turn, recognize such violations to draw pragmatic inferences.Building on Surian et al. (1996)’s study of children’s sensitivity to violations of Gricean maxims, we introduce a novel benchmark to test whether language models pretrained on <10M and <100M tokens can distinguish maxim-adhering from maxim-violating utterances. We compare these BabyLMs across five maxims and situate their performance relative to children and a Large Language Model (LLM) pretrained on 3T tokens.We find that overall, models trained on <100M tokens outperform those trained on <10M, yet fall short of child-level and LLM competence. Our results suggest that modest data increases improve some aspects of pragmatic behavior, leading to finer-grained differentiation between pragmatic dimensions.

pdf bib abs
Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models
Ece Takmaz | Lisa Bylinina | Jakub Dotlacil

State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in language-only tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with model merging, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.

pdf bib abs
TafBERTa: Learning Grammatical Rules from Small-Scale Language Acquisition Data in Hebrew
Anita Gelboim | Elior Sulem

We present TafBERTa, a compact RoBERTa based language model tailored for Hebrew child-directed speech (CDS). This work builds upon the BabyBERTa framework to address data scarcity and morphological complexity in Hebrew. Focusing on determiner-noun grammatical agreement phenomena, we show that TafBERTa achieves competitive performance compared to large-scale Hebrew language models while requiring significantly less data and computational resources. As part of this work, we also introduce a new corpus of Hebrew CDS, HTBerman, aligned with morphological metadata and our new grammatical evaluation benchmark for Hebrew, HeCLiMP, based on minimal pairs. Our results demonstrate the effectiveness of TafBERTa in grammaticality judgments and its potential for efficient NLP in low-resource settings.

pdf bib abs
FORGETTER with forgetful hyperparameters and recurring sleeps can continue to learn beyond normal overtfitting limits
Yamamoto Rui | Keiji Miura

LLMs suffer from considerable computational costs in training.A more biologically plausible curriculum learning may help to decrease the learning costs.Here we propose a FORGETTER training algorithm,in which a model forgets the variables for optimization after a sleepand the hyperparameters are set toward forgetting memory:rather large weight decay and learning rates as well as small but optimized batch sizes.By limiting minGemma model to 512 input length and speeding up the development cycle,we compared normal and FORGETTER learning algorithms by using more than a thousand different models.Specifically, we found and utilized the “120-rule” that the models with about 120 (Query) heads in total, irrespective of the head number per layer, outperform.The improvement by using the FORGETTER algorithm is far bigger than that by optimizing the model structure.Specifically, FORGETTER models can learn beyond the data size where the normal learning overfits.The FORGETTER also works for CIFAR10 image classification.These results suggest that forgetting can be beneficial for pretraining deep neural networks by avoiding overfitting.

pdf bib abs
Large Language Models and Children Have Different Learning Trajectories in Determiner Acquisition
Olivia La Fiandra | Nathalie Fernandez Echeverri | Patrick Shafto | Naomi H. Feldman

Large language models are often compared to human learners based on the amount of training data required or the end state capabilities of a learner, yet less attention has been given to differences in their language learning process. This study uses determiner acquisition as a case study to characterize how LLMs and children differ in their learning processes. By analyzing annotated speech samples from specified age ranges of four children and intermediate training checkpoints of the Pythia-70m language model, we trace the learners’ learning paths of definite and indefinite determiner use. Our results reveal a divergence: the children first produce the indefinite determiner, while the model first produces the definite determiner. This difference reflects underlying differences in the learning goals and mechanisms of models and children. Framing language learning as movement over distributions of linguistic features makes the learning process visible and offers an alternative approach for comparing humans and language models.

pdf bib abs
Design and Analysis of few Million Parameter Transformer-based Language Models trained over a few Million Tokens Dataset
Yen-Che Hsiao | Abhishek Dutta

In this work, we systematically explore training methods and perform hyperparameter tuning to identify key language model parameters upper bounded by 28 million. These models are designed to generate a broad spectrum of basic general knowledge in simple and coherent English with limited generalization ability. We use the Simple English Wikipedia as the training dataset, selecting samples between 64 and 512 words, which provides a high-quality, compressed representation of general knowledge in basic English. Through hyperparameter tuning, we identify the best-performing architecture, yielding the lowest training loss, as a decoder-only Transformer with rotary positional encoding, multi-head attention, root-mean-square normalization, Gaussian error linear unit activation, post-normalization, no interleaved group query attention, an embedding dimension of 512, 8 layers, 8 attention heads, a feedforward dimension of 2048, and zero dropout. Models trained with a learning rate decaying linearly from 10^-4 to 10^-5 over 64 epochs achieve a training loss of 0.1, which appears sufficient for reproducing text more effectively than models trained to losses of 0.2 or 0.5. Fine-tuning on rephrased text further demonstrates that the model retains its ability to produce simple and coherent English covering broad basic knowledge, while exhibiting limited generalization capability.

pdf bib abs
What is the Best Sequence Length for BabyLM?
Suchir Salhan | Richard Diehl Martinez | Zebulon Goriely | Paula Buttery

Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.

Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per‐layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding‐window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality–speed trade–off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.

pdf bib abs
Exploring smaller batch sizes for a high-performing BabyLM model architecture
Sharid Loáiciga | Eleni Fysikoudi | Asad B. Sayeed

We explore the conditions under which the highest-performing entry to the BabyLM task in 2023, Every Layer Counts BERT or ELC-BERT, is best-performing given more constrained resources than the original run, with a particular focus on batch size. ELC-BERT’s relative success, as an instance of model engineering compared to more cognitively-motivated architectures, could be taken as evidence that the “lowest-hanging” fruit is to be found from non-linguistic machine learning approaches. We find that if we take away the advantage of training time from ELC-BERT, the advantage of the architecture mostly disappears, but some hyperparameter combinations nevertheless differentiate themselves in performance.

pdf bib abs
BLiSS: Evaluating Bilingual Learner Competence in Second Language Small Language Models
Yuan Gao | Suchir Salhan | Andrew Caines | Paula Buttery | Weiwei Sun

Cross-lingual extensions of the BabyLM Shared Task beyond English incentivise the development of Small Language Models that simulate a much wider range of language acquisition scenarios, including code-switching, simultaneous and successive bilingualism and second language acquisition. However, to our knowledge, there is no benchmark of the formal competence of cognitively-inspired models of L2 acquisition, or L2LMs. To address this, we introduce a Benchmark of Learner Interlingual Syntactic Structure (BLiSS). BLiSS consists of 1.5M naturalistic minimal pairs dataset derived from errorful sentence–correction pairs in parallel learner corpora. These are systematic patterns –overlooked by standard benchmarks of the formal competence of Language Models – which we use to evaluate L2LMs trained in a variety of training regimes on specific properties of L2 learner language to provide a linguistically-motivated framework for controlled measure of the interlanguage competence of L2LMs.

pdf bib abs
Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements
Patrick Haller | Jonas Golde | Alan Akbik

We study architectural and optimization techniques for sample-efficient language modeling under the constraints of the BabyLM 2025 shared task. Our model, BLaLM, replaces self-attention with a linear-time mLSTM token mixer and explores lightweight enhancements, including short convolutions, sliding window attention with dynamic modulation, and Hedgehog feature maps. To support training in low-resource settings, we curate a high-quality corpus emphasizing readability and pedagogical structure. Experiments across both strict and strict-small tracks show that (1) linear attention combined with sliding window attention consistently improves zero-shot performance, and (2) the Muon optimizer stabilizes convergence and reduces perplexity over AdamW. These results highlight effective strategies for efficient language modeling without relying on scale.

pdf bib abs
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling
Bianca-Mihaela Ganescu | Suchir Salhan | Andrew Caines | Paula Buttery

Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.

pdf bib abs
A Comparison of Elementary Baselines for BabyLM
Rareș Păpușoi | Sergiu Nisioi

This paper explores multiple simple baselines for the BabyLM challenge, covering random models, elementary predictions based on frequency, n-gram language models, LSTM with several tokenizers (BPE, Unigram, SuperBPE), and GPT-BERT, the winning architecture from the prior BabyLM edition. The evaluation is focused on the BLiMP and BLiMP-Supplement benchmarks. Our experiments show that Strict-Small can sometimes outperform Strict, the fact that performance can be highly sensitive to tokenization and the importance of data efficiency. A simple word-frequency baseline scored unexpectedly high, which led to identifying an evaluation artifact in the pipeline: a system that returns identical logits for both sentences in a minimal pair can achieve maximal accuracy.

pdf bib abs
Two ways into the hall of mirrors: Language exposure and lossy memory drive cross-linguistic grammaticality illusions in language models
Kate McCurdy | Katharina Christian | Amelie Seyfried | Mikhail Sonkin

Readers of English — but not Dutch or German — consistently show a grammaticality illusion: they find ungrammatical double-center-embedded sentences easier to process than corresponding grammatical sentences. If pre-trained language model (LM) surprisal mimics these cross-linguistic patterns, this implies that language statistics explain the effect; if, however, the illusion requires memory constraints such as lossy context surprisal (LCS), this suggests a critical role for memory. We evaluate LMs in Dutch, German, and English. We find that both factors influence LMs’ susceptibility to grammaticality illusions, and neither fully account for human-like processing patterns.

pdf bib abs
What did you say? Generating Child-Directed Speech Questions to Train LLMs
Whitney Poh | Michael Tombolini | Libby Barak

Child-Directed Speech (CDS) holds unique linguistic properties that distinguish it from other types of textual corpora. Language models trained using CDS often obtain superior results compared with the same size of different types of data. Several studies have aimed at modifying non-CDS data to mimic its linguistic properties to match the hypothesized advantageous aspects of CDS. Here, we propose to adapt the non-CDS portions of the training data to include questions similar to CDS interaction. We modify the data by adding artificially generated questions to the data and methodically analyzing the change in performance using each modified dataset. Our results show that artificial question generation strongly depends on the properties of the original dataset. While the performance improves for question-related measures, the overall performance is negatively affected as a result of the reduced syntactic diversity.

pdf bib abs
Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining
Matthew Theodore Roque | Dan John Velasco

Most language model pretraining studies assume large data volumes, leaving open how to improve pretraining in data-constrained settings beyond repeated exposure. In such settings, the effects of training data order and of including alternative versions of the same text remain underexplored. We address this by studying curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification. We ask: (1) Does simplifying texts enhance representation quality more than reusing the original data?; and (2) Does ordering data by text complexity yield better representations? To answer, we simplify a high-quality English dataset using a large language model and test four data schedules: (1) repeated exposure, (2) low-to-high complexity, (3) high-to-low, and (4) interleaved. We analyze models’ representation quality from a sample-efficiency perspective via fine-tuning, as well as its zero-shot performance on linguistic knowledge, entity tracking, world knowledge, and commonsense reasoning. Our findings show that adding simplified data improves fine-tuning and zero-shot performance over repeated exposure baseline: smaller models benefit from low-to-high complexity, while larger models perform better with interleaved ordering.

pdf bib abs
CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models
Pavan Kalyan Tankala | Shubhra Mishra | Satya Lokam | Navin Goyal

We introduce a comprehensive continual learning dataset and benchmark CurLL grounded in human developmental trajectories from ages 5–10, enabling systematic and fine-grained assessment of models’ ability to progressively acquire new skills. CurLL spans five developmental stages (0–4) covering ages 5–10, with a skill graph of 32 high-level skills, 128 sub-skills, 350+ goals, and 1,300+ indicators explicitly modeling prerequisite relationships. We generate a 23.4B-token synthetic dataset with controlled skill progression, vocabulary complexity, and format diversity, comprising paragraphs, comprehension-based QA (CQA), skill-testing QA (CSQA), and instruction–response (IR) pairs. Stage-wise token counts range from 2.12B to 6.78B tokens, supporting precise analysis of forgetting, forward transfer, and backward transfer. Using a 135M-parameter transformer trained under independent, joint, and sequential (continual) setups, we show trade-offs in skill retention and transfer efficiency. By mirroring human learning patterns and providing fine-grained control over skill dependencies, this work advances continual learning evaluations for language models.

pdf bib abs
A Morpheme-Aware Child-Inspired Language Model
Necva Bölücü | Burcu Can

Most tokenization methods in language models rely on subword units that lack explicit linguistic correspondence. In this work, we investigate the impact of using morpheme-based tokens in a small language model, comparing them to the widely used frequency-based method, BPE. We apply the morpheme-based tokenization method to both 10-million and 100-million word datasets from the BabyLM Challenge. Our results show that using a morphological tokenizer improves EWoK (basic world knowledge) performance by around 20% and entity tracking by around 40%, highlighting the impact of morphological information in developing smaller language models. We also apply curriculum learning, in which morphological information is gradually introduced during training, mirroring the vocabulary-building stage in infants that precedes morphological processing. The results are consistent with previous research: curriculum learning yields slight improvements for some tasks, but performance degradation in others.

pdf bib abs
Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?
Arzu Burcu Güven | Anna Rogers | Rob Van Der Goot

We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.

pdf bib abs
SlovakBabyLM: Replication of the BabyLM and Sample-efficient Pretraining for a Low-Resource Language
Ľuboš Kriš | Marek Suppa

In recent years, we can observe a trend of creating various specific language models (LMs) within the Slavic language family with the Bert architecture. However, with an increasing number of parameters of LM, a larger amount of text is required for good performance, which can hinder the development and creation of LMs for specific languages. Our research is looking for a solution in Curriculum learning(CL) methods that can help us build better models with a lower amount of text in comparison with current LMs, which can help in better prtraining of models with low resource languages(LRL). Therefore, we replicate the BabyLM Challenge in the Slovak language (Dataset: https://huggingface.co/datasets/ubokri/SlovakBabyLM, Code: https://github.com/baucek/Slovakbabylm/tree/main). Additionally, apply CL to test and see the difference in the application of CL methods on the English and Slovak languages and evaluate whether the CL improves performance of LM. Our experiments show that the use of CL methods as preprocessing methods is significant for improving model performance in sentiment analysis and question answering.

pdf bib abs
Single layer tiny Co4 outpaces GPT-2 and GPT-BERT
Noor Ul Zain | Mohsin Raza Naseem | Ahsan Adeel

We show that a tiny Co⁴ machine (CITATION) with a single layer, two heads, and 8M parameters, operating at O(N) computational cost (where N is the number of input tokens), in just 2 epochs outpaces GPT-2 (124M, 12 layers, O(N²)) and GPT-BERT (30M, 12 layers, O(N²)), both trained for 10 epochs. Co⁴ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating sample-efficient pretraining. On the BabyLM challenge evaluation pipeline, Co⁴ performs comparably or better across complex benchmarks, showing strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co⁴ outperforms GPT-2 in 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT in 4 out of 7 metrics in both cases. These results strongly suggest a need to rethink prevailing deep learning paradigms and associated scaling laws.

Multi-turn dialogues between a child and caregiver are characterized by a property called contingency – prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a Teacher–Student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive Teacher decoding strategies show limited additional gains. ContingentChat highlights the positive benefits of targeted post-training on dialogue quality and presents contingency as a challenging goal for BabyLMs.

pdf bib abs
Influence-driven Curriculum Learning for Pre-training on Limited Data
Loris Schoenegger | Lukas Thoma | Terra Blevins | Benjamin Roth

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their training data influence, a score which estimates the effect of individual training examples on the model’s output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.

Hybrid models that combine state space models (SSMs) with attention mechanisms have demonstrated strong performance by leveraging the efficiency of SSMs and the high recall ability of attention. However, the underlying reasons for these benefits remain insufficiently understood. In this work, we investigate hybrid architectures through the lens of memory utilization and overall performance, and propose a complementary method to further enhance their effectiveness. We focus in particular on the distinction between sequential and parallel integration of SSM and attention layers. Our analysis reveals that sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts. Among various configurations, parallel hybrids using a cross-attention to combine SSM and attention outputs perform best. We also introduce a data-centric approach to further improve model performance: continual training on datasets with paraphrases. This method strikes the best balance across various other datasets, enhancing memory recall while preserving other capabilities. It generalizes well across different base models, including pure SSMs, and outperforms architectural modifications aimed at enhancing recall.

This report summarizes the findings from the 3rd BabyLM Challenge and the 1st BabyLM Workshop. The BabyLM Challenge is a shared task aimed at closing the data efficiency gap between human and machine language learners. The goal is to improve the performance of language models given a fixed training budget of no more than 100 million words. This year, the challenge was held as part of an expanded BabyLM Workshop that invited paper submissions on topics relevant to the BabyLM effort, including sample-efficient pretraining and cognitive modeling for LMs. For the challenge, we kept the text-only and text–image tracks from previous years, but also introduced a new interaction track, where student models are allowed to learn from feedback from larger teacher models. Furthermore, we introduce a new set of evaluation tasks to assess the “human likeness” of models on a cognitive and linguistic level, limit the total amount of training compute allowed, and measure performance on intermediate checkpoints. We observe that new training objectives and architectures tend to produce the best-performing approaches, and that interaction with teacher models can yield high-quality language models. The strict and interaction tracks saw submissions that outperformed the best-performing methods from previous years. We do not observe a complete correlation between training FLOPs and performance. %, suggesting that some methods can produce real gains beyond allowing us to spend more compute. This year’s BabyLM Challenge shows that there is still room to innovate in a data-constrained setting, and that community-driven research can yield actionable insights for language modeling.

We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce “more communicative” text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.

pdf bib abs
CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs
Luca Capone | Alessandro Bondielli | Alessandro Lenci

This work investigates whether small-scale LMs can benefit from instruction tuning (IT). We compare conversational and question–answering IT datasets, applied either in a merged or sequential curriculum, using decoder-only models with 100M and 140M parameters. Evaluation spans both fine-tuning (SuperGLUE) and zero-shot (BLiMP, EWoK, WUGs, entity tracking, and psycholinguistic correlation) settings. Results show that IT yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data; however, improvements do not consistently transfer to zero-shot tasks, suggesting a trade-off between interaction-focused adaptation and broad linguistic generalization. These results highlight both the potential and the constraints of adapting human-inspired learning strategies to low-resource LMs, and point toward hybrid, curriculum-based approaches for enhancing generalization under ecological training limits.

pdf bib abs
Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs
Lukas Edman | Alexander Fraser

We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model’s ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model’s morphological generalization capabilities. Our submission beats the baseline in the strict-small track.

pdf bib abs
Once Upon a Time: Interactive Learning for Storytelling with Small Language Models
Jonas Mayer Martins | Ali Hamza Bashir | Muhammad Rehan Khalid | Lisa Beinborn

Children efficiently acquire language not just by listening, but by interacting with others in their social environment. Conversely, large language models are typically trained with next-word prediction on massive amounts of text. Motivated by this contrast, we investigate whether language models can be trained with less data by learning not only from next-word prediction but also from high-level, cognitively inspired feedback. We train a student model to generate stories, which a teacher model rates on readability, narrative coherence, and creativity. By varying the amount of pretraining before the feedback loop, we assess the impact of this interactive learning on formal and functional linguistic competence. We find that the high-level feedback is highly data efficient: With just 1 M words of input in interactive learning, storytelling skills can improve as much as with 410 M words of next-word prediction.

pdf bib abs
You are an LLM teaching a smaller model everything you know: Multi-task pretraining of language models with LLM-designed study plans
Wiktor Kamzela | Mateusz Lango | Ondrej Dusek

This paper proposes a multi-task pre-training of language models without any text corpora.The method leverages an existing Large Language Model (LLM) to generate a diverse corpus containing training data for 56 automatically designed tasks and uses generated labels to enhance the training signal.The method does not rely on hidden states or even output distributions of the teacher model, so may be employed in scenarios when the teacher LLM is available only through an API.The conducted experiments show that models trained on the proposed synthetic corpora achieve competitive or superior performance compared to those trained on same-sized human-written texts.

pdf bib abs
Active Curriculum Language Modeling over a Hybrid Pre-training Method
Eleni Fysikoudi | Sharid Loáiciga | Asad B. Sayeed

We apply the Active Curriculum Language Modeling (ACLM) method to the constrained pretraining setting of the 2025 BabyLM Challenge, where models are limited by both data and compute budgets. Using GPT-BERT (Charpentier and Samuel, 2024) as the base architecture, we investigate the impact of surprisal-based example selection for constructing a training curriculum. In addition, we conduct a targeted hyperparameter search over tokenizer size and batch size. Our approach yields stable pretrained models that surpass the official baseline on multiple evaluation tasks, demonstrating ACLM’s potential for improving performance and generalization in low-resource pretraining scenarios.

pdf bib abs
Linguistic Units as Tokens: Intrinsic and Extrinsic Evaluation with BabyLM
Achille Fusco | Maria Letizia Piccini Bianchessi | Tommaso Sgrizzi | Asya Zanollo | Cristiano Chesi

Tokenization is often treated as a preprocessing step, yet in data-limited settings it directly shapes what a model can learn. We compare four segmentation strategies in the BabyLM Challenge: frequency-based BPE, morphology-aware MorPiece and ParadigmFinder, and syllable-based SylliTok. Evaluation combines two perspectives. First, an intrinsic test on the SIGMORPHON 2022 segmentation benchmark, adapted to English, measures how closely each tokenizer aligns with morpheme boundaries. Second, extrinsic tests train GPT-2 on the 10M BabyLM corpus and evaluate on the 2025 benchmark. No single tokenizer dominates. BPE remains strong on syntax-heavy tasks. ParadigmFinder excels in semantic composition and age-of-acquisition alignment. MorPiece shows advantages in discourse tracking. Morphology-aware tokenizers achieve the best intrinsic segmentation scores, and these gains translate into more robust generalisation in comprehension tasks. These results highlight tokenization as a core modeling decision, with direct consequences for compression, morphology, and the path to humanlike learning.

Human children acquire language from a substantially smaller amount of linguistic input than that typically required for training large language models (LLMs). This gap motivates the search for more efficient pre-training methods. Inspired by child development, curriculum learning, which progresses from simple to complex data, has been widely adopted. In this study, we propose a pre-training framework that mirrors child language acquisition, advancing step by step from words to sentences while retaining prior knowledge. We investigate whether this improves retention and efficiency under limited resources. Our approach is implemented through four components: (i) a curriculum-aligned dataset, (ii) a batch-wise convergence loop, (iii) a distance-controlled loss to mitigate forgetting, and (iv) a constraint-controlled optimizer for stability. Experiments on the BabyLM benchmark show that the proposed method performs slightly below the official baselines in overall accuracy, with larger gaps on grammar-oriented evaluations such as BLiMP. Nonetheless, it yields small but consistent gains on morphology- and discourse-related tasks (e.g., WUG-ADJ, Entity Tracking), suggesting that the approach affects different linguistic aspects unevenly under limited data conditions.

pdf bib abs
Pretraining Language Models with LoRA and Artificial Languages
Nalin Kumar | Mateusz Lango | Ondrej Dusek

Large language models (LLMs) require a substantial amount of training data, which contrasts with the data-efficient learning observed in humans. In our submission to the BabyLM Challenge, we address this disparity by proposing a parameter-efficient pretraining approach for language acquisition from limited data. Our approach involves initializing the model with token embeddings trained by a shallow model, followed by tuning the non-embedding parameters with non-linguistic data to introduce structural biases. Then, we freeze the resulting model and pretrain it on the 10M-token BabyLM corpus using LoRA adapters. Experiments on small corpora demonstrate that our approach improves upon classic pretraining of the entire model.

pdf bib abs
Masked Diffusion Language Models with Frequency-Informed Training
Despoina Kosmopoulou | Efthymios Georgiou | Vaggelis Dorovatas | Georgios Paraskevopoulos | Alexandros Potamianos

We present a masked diffusion language modeling framework for data-efficient training for the BabyLM 2025 Challenge. Our approach applies diffusion training objectives to language modeling under strict data constraints, incorporating frequency-informed masking that prioritizes learning from rare tokens while maintaining theoretical validity. We explore multiple noise scheduling strategies, including two-mode approaches, and investigate different noise weighting schemes within the NELBO objective. We evaluate our method on the BabyLM benchmark suite, measuring linguistic competence, world knowledge, and human-likeness. Results show performance competitive to hybrid autoregressive-masked baselines, demonstrating that diffusion-based training offers a viable alternative for data-restricted language learning.

pdf bib abs
MoEP: Modular Expert Paths for Sample-Efficient Language Modeling
Joonas Tapaninaho

Training language models under tight compute budgets with small training datasets remains challenging for dense decoder-only Transformers, where every token activates the full stack of model parameters. We introduce MoEP (Modular Expert Paths), a sparse decoder-only architecture that enables more selective token activation, which increases model performance and accelerates learning without increasing the total number of parameters. We show that combining model parallelism with Mixture-of-Experts (MoE) style linear projections and a lightweight top-k router outperforms the GPT-2 baseline and stabilizes evaluation performance more quickly.

pdf bib abs
RecombiText: Compositional Data Augmentation for Enhancing LLM Pre-Training Datasets in Low-Resource Scenarios
Alexander Tampier | Lukas Thoma | Loris Schoenegger | Benjamin Roth

We introduce RecombiText Augmentation (RTA), a novel purely statistical NLP method for compositional data augmentation for data-efficient LLM pre-training in low-resource scenarios. RTA identifies lexically and semantically similar sentences within the corpus and generates synthetic sentence pairs from them while preserving underlying patterns from the corpus. We pre-train GPT-2 and RoBERTa language models on a domain-specific, low-resource corpus of 10 million words, with different proportions of augmented data. We compare our RTA-augmented model variants to a baseline model trained on the full original dataset. Zero-shot results show that the language models pre-trained on synthetic data improve in entity tracking, self-paced reading, and morphological generalization benchmarks. In other tasks, the performance is comparable to the baseline model. We demonstrate that it is possible to expand low-resource datasets by two- to four-fold without compromising benchmark performance, solely through statistical processing of the available data.