Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Tianjian Li, Haoran Xu, Weiting Tan, Kenton Murray, Daniel Khashabi


Abstract
Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed toward high-resource languages, creating significant imbalances in training data sizes across languages. This disparity challenges training language models to perform uniformly well in all languages. Two common strategies to address this issue are upsampling low-resource languages (Temperature Sampling) and upweighting their loss functions (Scalarization). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation.Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under full gradient descent but differ under stochastic gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation, leading to faster convergence but a higher risk of overfitting. Based on these insights, we propose Cooldown, a strategy that starts by heavily upsampling low-resource languages to accelerate convergence and gradually reduces the upsampling to prevent overfitting—achieving the best of both worlds. Our method competes effectively with existing data re-weighting techniques while offering computational efficiency.
Anthology ID:
2025.naacl-long.171
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3325–3343
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.171/
DOI:
Bibkey:
Cite (ACL):
Tianjian Li, Haoran Xu, Weiting Tan, Kenton Murray, and Daniel Khashabi. 2025. Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3325–3343, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets (Li et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.171.pdf