Single layer tiny Co4 outpaces GPT-2 and GPT-BERT

Noor Ul Zain, Mohsin Raza Naseem, Ahsan Adeel


Abstract
We show that a tiny Co4 machine (CITATION) with a single layer, two heads, and 8M parameters, operating at O(N) computational cost (where N is the number of input tokens), in just 2 epochs outpaces GPT-2 (124M, 12 layers, O(N2)) and GPT-BERT (30M, 12 layers, O(N2)), both trained for 10 epochs. Co4 achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating sample-efficient pretraining. On the BabyLM challenge evaluation pipeline, Co4 performs comparably or better across complex benchmarks, showing strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co4 outperforms GPT-2 in 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT in 4 out of 7 metrics in both cases. These results strongly suggest a need to rethink prevailing deep learning paradigms and associated scaling laws.
Anthology ID:
2025.babylm-main.24
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
313–322
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.24/
DOI:
Bibkey:
Cite (ACL):
Noor Ul Zain, Mohsin Raza Naseem, and Ahsan Adeel. 2025. Single layer tiny Co4 outpaces GPT-2 and GPT-BERT. In Proceedings of the First BabyLM Workshop, pages 313–322, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Single layer tiny Co4 outpaces GPT-2 and GPT-BERT (Zain et al., BabyLM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.24.pdf