LimaCost: Data Valuation for Instruction Tuning of Large Language Models

Hyeonseok Moon; Jaehyung Seo; Seonmin Koo; Jinsung Kim; Young-kyoung Ham; Jiwon Moon; Heui-Seok Lim

doi:10.18653/v1/2025.findings-emnlp.688

LimaCost: Data Valuation for Instruction Tuning of Large Language Models

Hyeonseok Moon, Jaehyung Seo, Seonmin Koo, Jinsung Kim, Young-kyoung Ham, Jiwon Moon, Heuiseok Lim

Abstract

Instruction tuning (IT) is an effective approach for aligning large language models (LLMs) with human intentions. There is ongoing discourse regarding the data quality for IT. As an effort to find the robust criteria of data quality for IT, we introduce LimaCost, a data quality measure that exhibits a strong correlation with model performance. LimaCost utilizes LIMA dataset, which effectiveness in IT has already been validated by several previous works. LimaCost then estimates the value of a given data by estimating how many LIMA data points might be needed to approximate its gradient. Our experiments reveal that LimaCost enables effective data selection that derive high alignment performance. We demonstrate that selecting data based on high LimaCost proves to be more effective than existing data selection strategies.

Anthology ID:: 2025.findings-emnlp.688
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12841–12854
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.688/
DOI:: 10.18653/v1/2025.findings-emnlp.688
Bibkey:
Cite (ACL):: Hyeonseok Moon, Jaehyung Seo, Seonmin Koo, Jinsung Kim, Young-kyoung Ham, Jiwon Moon, and Heuiseok Lim. 2025. LimaCost: Data Valuation for Instruction Tuning of Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 12841–12854, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: LimaCost: Data Valuation for Instruction Tuning of Large Language Models (Moon et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.688.pdf
Checklist:: 2025.findings-emnlp.688.checklist.pdf

PDF Cite Search Checklist Fix data