Abstract
Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training labels from generative large language models (LLMs). We introduce a recommended workflow and test this LLM application by replicating 14 classification tasks and measuring performance. We employ a novel corpus of English-language text classification data sets from recent CSS articles in high-impact journals. Because these data sets are stored in password-protected archives, our analyses are less prone to issues of contamination. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. Our findings indicate that supervised classification models fine-tuned on LLM-generated labels perform comparably to models fine-tuned with labels from human annotators. Fine-tuning models using LLM-generated labels can be a fast, efficient and cost-effective method of building supervised text classifiers.- Anthology ID:
- 2024.nlpcss-1.9
- Volume:
- Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024)
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Dallas Card, Anjalie Field, Dirk Hovy, Katherine Keith
- Venues:
- NLP+CSS | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 113–131
- Language:
- URL:
- https://aclanthology.org/2024.nlpcss-1.9
- DOI:
- Cite (ACL):
- Nicholas Pangakis and Sam Wolken. 2024. Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels. In Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024), pages 113–131, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels (Pangakis & Wolken, NLP+CSS-WS 2024)
- PDF:
- https://preview.aclanthology.org/ingestion-checklist/2024.nlpcss-1.9.pdf