Language Model-Driven Data Pruning Enables Efficient Active Learning

Abdul Hameed Azeemi; Ihsan Ayyub Qazi; Agha Ali Raza

Language Model-Driven Data Pruning Enables Efficient Active Learning

Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza

Abstract

Active learning (AL) optimizes data labeling efficiency by selecting the most informative instances for annotation. However, scaling active learning to large datasets remains a critical challenge, as AL acquisition functions incur prohibitive computational costs when evaluating large unlabeled data pools. To bridge this gap, we introduce a novel plug-and-play data pruning strategy, ActivePrune, which leverages language models to prune the unlabeled pool. ActivePrune implements a two-stage pruning process: an initial fast evaluation using perplexity scores from an n-gram language model, followed by a high-quality selection using metrics for data quality computed through a quantized LLM. To enhance the diversity of the unlabeled pool, we propose a novel perplexity reweighting method that systematically brings forward underrepresented instances for selection. Experiments on translation, sentiment analysis, topic classification, and summarization tasks on diverse datasets and AL strategies demonstrate that ActivePrune outperforms existing data pruning methods. Finally, we compare the selection quality ↔ efficiency tradeoff of the data pruning methods and show that ActivePrune provides up to 74% reduction in the end-to-end AL time compared to other LLM score-based pruning methods.

Anthology ID:: 2026.findings-eacl.229
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4373–4392
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.229/
DOI:
Bibkey:
Cite (ACL):: Abdul Hameed Azeemi, Ihsan Ayyub Qazi, and Agha Ali Raza. 2026. Language Model-Driven Data Pruning Enables Efficient Active Learning. In Findings of the Association for Computational Linguistics: EACL 2026, pages 4373–4392, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Language Model-Driven Data Pruning Enables Efficient Active Learning (Azeemi et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.229.pdf
Checklist:: 2026.findings-eacl.229.checklist.pdf

PDF Cite Search Checklist Fix data