Mohsin Raza Naseem

2025

pdf bib abs
Single layer tiny Co4 outpaces GPT-2 and GPT-BERT
Noor Ul Zain | Mohsin Raza Naseem | Ahsan Adeel
Proceedings of the First BabyLM Workshop

We show that a tiny Co⁴ machine (CITATION) with a single layer, two heads, and 8M parameters, operating at O(N) computational cost (where N is the number of input tokens), in just 2 epochs outpaces GPT-2 (124M, 12 layers, O(N²)) and GPT-BERT (30M, 12 layers, O(N²)), both trained for 10 epochs. Co⁴ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating sample-efficient pretraining. On the BabyLM challenge evaluation pipeline, Co⁴ performs comparably or better across complex benchmarks, showing strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co⁴ outperforms GPT-2 in 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT in 4 out of 7 metrics in both cases. These results strongly suggest a need to rethink prevailing deep learning paradigms and associated scaling laws.

Co-authors

Ahsan Adeel 1
Noor Ul Zain 1

Venues

babylm1

Fix author