When Do LLMs Need Human Experts? Evidence for Social Science from Jurisprudential Classification

Caroline Cheng, Edward Stiglitz, David Mimno, Matthew Wilkens


Abstract
Social scientists increasingly use large language models (LLMs) to classify text at scale, raising a key question: when can LLMs replace expert human annotation? Prior work found that earlier generative models failed on complex social science tasks while fine-tuned BERT succeeded, but whether current frontier-scale models close this gap remained untested. We investigate this question on a challenging legal reasoning task—classifying paragraphs from U.S. Supreme Court opinions as employing formal, grand, or no reasoning. Testing frontier LLMs including GPT-5.2 and leading open-weight alternatives, we find that even the most capable prompted models consistently underperform fine-tuned BERT. Only when high-parameter-count generative LLMs are fine-tuned on human-annotated training data does performance improve, and fine-tuned BERT remains a cost-effective alternative. Contrary to a common view, our results demonstrate that scaling to frontier-size LLMs does not eliminate the need for expert annotation on tasks requiring deep domain expertise—a finding with important implications for computational social science measurement.
Anthology ID:
2026.nlpcss-1.6
Volume:
Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science
Month:
July
Year:
2026
Address:
San Diego
Editors:
Dallas Card, Anjalie Field, Katherine Keith, Julia Mendelsohn
Venues:
NLP+CSS | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
103–112
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.nlpcss-1.6/
DOI:
Bibkey:
Cite (ACL):
Caroline Cheng, Edward Stiglitz, David Mimno, and Matthew Wilkens. 2026. When Do LLMs Need Human Experts? Evidence for Social Science from Jurisprudential Classification. In Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science, pages 103–112, San Diego. Association for Computational Linguistics.
Cite (Informal):
When Do LLMs Need Human Experts? Evidence for Social Science from Jurisprudential Classification (Cheng et al., NLP+CSS 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.nlpcss-1.6.pdf