CondenseLM: LLMs-driven Text Dataset Condensation via Reward Matching

Cheng Shen; Yew-Soon Ong; Joey Tianyi Zhou

CondenseLM: LLMs-driven Text Dataset Condensation via Reward Matching

Cheng Shen, Yew-Soon Ong, Joey Tianyi Zhou

Abstract

Dataset condensation has emerged as a promising technique to improve data efficiency under limited data budgets. However, when applied to the text level, existing methods struggle to compress more information into samples through optimization. Thus, these methods provide no obvious advantage over simpler coreset selection despite their high computational cost. In this paper, we introduce CondenseLM, a novel paradigm for both effective and efficient text-level dataset condensation. Our framework employs an LLMs-driven approach to sidestep the inherent limitations of existing methods, successfully generating more informative and less biased samples. In addition, it incorporates reward matching to align the LLMs-condensed dataset with the original dataset, maximizing representability and coverage. We conducted extensive experiments on SST-2, MNLI, AG News, and IMDB. Our approach outperforms both coreset selection and existing dataset condensation methods by large margins while also substantially reducing the computational cost.

Anthology ID:: 2025.emnlp-main.65
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1237–1252
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.65/
DOI:
Bibkey:
Cite (ACL):: Cheng Shen, Yew-Soon Ong, and Joey Tianyi Zhou. 2025. CondenseLM: LLMs-driven Text Dataset Condensation via Reward Matching. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1237–1252, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: CondenseLM: LLMs-driven Text Dataset Condensation via Reward Matching (Shen et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.65.pdf
Checklist:: 2025.emnlp-main.65.checklist.pdf

PDF Cite Search Checklist Fix data