Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning

Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, Shiwei Liu


Abstract
Network pruning has emerged as a potential solution to make LLMs cheaper to deploy. However, existing LLM pruning approachesuniversally rely on the C4 dataset as the calibration data for calculating pruning scores, leaving its optimality unexplored. In this study, we evaluate the choice of calibration data on LLM pruning, across a wide range of datasets that are most commonly used in LLM training and evaluation, including four pertaining datasets as well as three categories of downstream tasks encompassing nine datasets. Each downstream dataset is prompted with In-Context Learning (ICL) and Chain-of-Thought (CoT), respectively. Besides the already intriguingobservation that the choice of calibration data significantly impacts the performance of pruned LLMs, our results also uncover several subtle and often unexpected findings, summarized as follows: (1) C4 is not the optimal choice for LLM pruning, even among commonly used pre-training datasets; (2) arithmetic datasets—when used as calibration data—performs on par or even better than pre-training datasets; (3) pruning with downstream datasets does not necessarily help the corresponding downstream task, compared to pre-training data; (4) ICL is widely beneficial to all data categories, whereas CoT is only useful on certain tasks. Our findings shed light on the importance of carefully selecting calibration data for LLM pruning and pave the way for more efficient deployment of these powerfulmodels in real-world applications. We release our code at: https://github.com/abx393/llm-pruning-calibration-data.
Anthology ID:
2024.emnlp-main.1004
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18089–18099
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1004
DOI:
10.18653/v1/2024.emnlp-main.1004
Bibkey:
Cite (ACL):
Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, and Shiwei Liu. 2024. Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18089–18099, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning (Bandari et al., EMNLP 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.1004.pdf