Pre-Training Curriculum for Multi-Token Prediction in Language Models

Ansar Aynetdinov; Alan Akbik

Pre-Training Curriculum for Multi-Token Prediction in Language Models

Abstract

Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next *k* tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.

Anthology ID:: 2025.acl-long.1243
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25573–25588
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1243/
DOI:
Bibkey:
Cite (ACL):: Ansar Aynetdinov and Alan Akbik. 2025. Pre-Training Curriculum for Multi-Token Prediction in Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25573–25588, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Pre-Training Curriculum for Multi-Token Prediction in Language Models (Aynetdinov & Akbik, ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1243.pdf

PDF Cite Search Fix data