AutoMixer: Checkpoint Artifacts as Automatic Data Mixers

Ernie Chang, Yang Li, Patrick Huber, Vish Vogeti, David Kant, Yangyang Shi, Vikas Chandra


Abstract
In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with accuracy increases of up to 1.93%. Overall, this demonstrates the potential of checkpoint models to enhance data quality and optimize data mixtures.
Anthology ID:
2025.acl-long.979
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19942–19953
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.979/
DOI:
Bibkey:
Cite (ACL):
Ernie Chang, Yang Li, Patrick Huber, Vish Vogeti, David Kant, Yangyang Shi, and Vikas Chandra. 2025. AutoMixer: Checkpoint Artifacts as Automatic Data Mixers. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19942–19953, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
AutoMixer: Checkpoint Artifacts as Automatic Data Mixers (Chang et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.979.pdf