Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets
Tianjian Li, Haoran Xu, Weiting Tan, Kenton Murray, Daniel Khashabi
Abstract
Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed toward high-resource languages, creating significant imbalances in training data sizes across languages. This disparity challenges training language models to perform uniformly well in all languages. Two common strategies to address this issue are upsampling low-resource languages (Temperature Sampling) and upweighting their loss functions (Scalarization). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation.Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under full gradient descent but differ under stochastic gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation, leading to faster convergence but a higher risk of overfitting. Based on these insights, we propose Cooldown, a strategy that starts by heavily upsampling low-resource languages to accelerate convergence and gradually reduces the upsampling to prevent overfitting—achieving the best of both worlds. Our method competes effectively with existing data re-weighting techniques while offering computational efficiency.- Anthology ID:
- 2025.naacl-long.171
- Volume:
- Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
- Month:
- April
- Year:
- 2025
- Address:
- Albuquerque, New Mexico
- Editors:
- Luis Chiruzzo, Alan Ritter, Lu Wang
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3325–3343
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.171/
- DOI:
- Cite (ACL):
- Tianjian Li, Haoran Xu, Weiting Tan, Kenton Murray, and Daniel Khashabi. 2025. Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3325–3343, Albuquerque, New Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets (Li et al., NAACL 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.171.pdf