Joonas Tapaninaho

2025

pdf bib abs
MoEP: Modular Expert Paths for Sample-Efficient Language Modeling
Joonas Tapaninaho
Proceedings of the First BabyLM Workshop

Training language models under tight compute budgets with small training datasets remains challenging for dense decoder-only Transformers, where every token activates the full stack of model parameters. We introduce MoEP (Modular Expert Paths), a sparse decoder-only architecture that enables more selective token activation, which increases model performance and accelerates learning without increasing the total number of parameters. We show that combining model parallelism with Mixture-of-Experts (MoE) style linear projections and a lightweight top-k router outperforms the GPT-2 baseline and stabilizes evaluation performance more quickly.

Co-authors

Venues

babylm1

Fix data