Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

Branislav Pecher, Ivan Srba, Maria Bielikova


Abstract
When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we answer an important question – how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 8 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average 100) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with fine-tuning on binary datasets requiring significantly more samples. When performance variance is taken into consideration, the number of required labels increases on average by 100 - 200%. Finally, larger models do not consistently lead to better performance and lower variance, with 4-bit quantisation having negligible impact.
Anthology ID:
2025.emnlp-main.9
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
165–184
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.9/
DOI:
Bibkey:
Cite (ACL):
Branislav Pecher, Ivan Srba, and Maria Bielikova. 2025. Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 165–184, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance (Pecher et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.9.pdf
Checklist:
 2025.emnlp-main.9.checklist.pdf