Abstract
Although Shapley values have been shown to be highly effective for identifying harmful training instances, dataset size and model complexity constraints limit the ability to apply Shapley-based data valuation to fine-tuning large pre-trained language models. To address this, we propose TS-DShapley, an algorithm that reduces computational cost of Shapley-based data valuation through: 1) an efficient sampling-based method that aggregates Shapley values computed from subsets for valuation of the entire training set, and 2) a value transfer method that leverages value information extracted from a simple classifier trained using representations from the target language model. Our experiments applying TS-DShapley to select data for fine-tuning BERT-based language models on benchmark natural language understanding (NLU) datasets show that TS-DShapley outperforms existing data selection methods. Further, TS-DShapley can filter fine-tuning data to increase language model performance compared to training with the full fine-tuning dataset.- Anthology ID:
- 2023.acl-srw.37
- Volume:
- Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Vishakh Padmakumar, Gisela Vallejo, Yao Fu
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 266–275
- Language:
- URL:
- https://aclanthology.org/2023.acl-srw.37
- DOI:
- 10.18653/v1/2023.acl-srw.37
- Cite (ACL):
- Stephanie Schoch, Ritwick Mishra, and Yangfeng Ji. 2023. Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 266–275, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values (Schoch et al., ACL 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2023.acl-srw.37.pdf