Capturing Epistemic Uncertainty in LLM-Based Soft Labeling

Yanru Jiang, Siyu Liang


Abstract
In many human-annotated NLP tasks involving ambiguity or subjective judgment, annotator disagreement reflects epistemic uncertainty rather than noise. Soft labeling (SL), which represents annotations as probability distributions rather than majority-vote (MV) labels, preserves this uncertainty and can improve downstream performance. We extend this perspective to LLM-based annotation by formalizing LLM soft labeling as introducing controlled variation in model-generated annotations to approximate the latent variability underlying human annotations. We distinguish two sources of variation: model-induced (e.g., stochastic decoding and model ensembles) and human-approximated (e.g., persona prompting and human-calibrated in-context annotation). Using the Gab Hate and GoEmotions datasets, we show that SL training consistently outperforms MV training under stronger LLM-based annotation strategies. Model ensembles produce the most informative soft-label distributions, achieving the best human–LLM agreement and downstream classification performance. These findings suggest that scalable LLM-based annotation pipelines can model epistemic uncertainty through diverse model-level variation without explicitly simulating human attributes.
Anthology ID:
2026.gem-main.21
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
177–190
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.21/
DOI:
Bibkey:
Cite (ACL):
Yanru Jiang and Siyu Liang. 2026. Capturing Epistemic Uncertainty in LLM-Based Soft Labeling. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 177–190, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Capturing Epistemic Uncertainty in LLM-Based Soft Labeling (Jiang & Liang, GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.21.pdf