Caroline Cheng
2026
When Do LLMs Need Human Experts? Evidence for Social Science from Jurisprudential Classification
Caroline Cheng | Edward Stiglitz | David Mimno | Matthew Wilkens
Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science
Caroline Cheng | Edward Stiglitz | David Mimno | Matthew Wilkens
Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science
Social scientists increasingly use large language models (LLMs) to classify text at scale, raising a key question: when can LLMs replace expert human annotation? Prior work found that earlier generative models failed on complex social science tasks while fine-tuned BERT succeeded, but whether current frontier-scale models close this gap remained untested. We investigate this question on a challenging legal reasoning task—classifying paragraphs from U.S. Supreme Court opinions as employing formal, grand, or no reasoning. Testing frontier LLMs including GPT-5.2 and leading open-weight alternatives, we find that even the most capable prompted models consistently underperform fine-tuned BERT. Only when high-parameter-count generative LLMs are fine-tuned on human-annotated training data does performance improve, and fine-tuned BERT remains a cost-effective alternative. Contrary to a common view, our results demonstrate that scaling to frontier-size LLMs does not eliminate the need for expert annotation on tasks requiring deep domain expertise—a finding with important implications for computational social science measurement.