Capturing Epistemic Uncertainty in LLM-Based Soft Labeling

Yanru Jiang; Siyu Liang

Capturing Epistemic Uncertainty in LLM-Based Soft Labeling

Abstract

In many human-annotated NLP tasks involving ambiguity or subjective judgment, annotator disagreement reflects epistemic uncertainty rather than noise. Soft labeling (SL), which represents annotations as probability distributions rather than majority-vote (MV) labels, preserves this uncertainty and can improve downstream performance. We extend this perspective to LLM-based annotation by formalizing LLM soft labeling as introducing controlled variation in model-generated annotations to approximate the latent variability underlying human annotations. We distinguish two sources of variation: model-induced (e.g., stochastic decoding and model ensembles) and human-approximated (e.g., persona prompting and human-calibrated in-context annotation). Using the Gab Hate and GoEmotions datasets, we show that SL training consistently outperforms MV training under stronger LLM-based annotation strategies. Model ensembles produce the most informative soft-label distributions, achieving the best human–LLM agreement and downstream classification performance. These findings suggest that scalable LLM-based annotation pipelines can model epistemic uncertainty through diverse model-level variation without explicitly simulating human attributes.

Anthology ID:: 2026.gem-main.21
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 177–190
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.21/
DOI:
Bibkey:
Cite (ACL):: Yanru Jiang and Siyu Liang. 2026. Capturing Epistemic Uncertainty in LLM-Based Soft Labeling. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 177–190, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Capturing Epistemic Uncertainty in LLM-Based Soft Labeling (Jiang & Liang, GEM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.21.pdf

PDF Cite Search Fix data