Label and Explanation Variation in LLM-Based Annotation: a Case Study in Natural Language Inference

Artur Kulmizev, Erika Lombart, Patrick Watrin, Marie-Catherine de Marneffe


Abstract
Large language models (LLMs) have shown considerable promise for annotation purposes, yet questions remain about their ability to capture human label variation (HLV) — genuine disagreement between annotators often observed across NLP tasks. Here, we investigate how label and explanation variation manifests within and across LLMs with respect to the Natural Language Inference (NLI) task. Using zero-shot prompting with exact human annotation instructions, we treat individual model generations as participants and examine three response sampling strategies: varying generation parameters, leveraging within-family model size differences, and pooling responses from distinct LLMs. We show that, while model ensembles can generate label distributions similar to humans, they likewise exhibit distinct, idiosyncratic judgments and disagreement patterns. We further analyze explanation variation, observing that, although models generate longer explanations than humans, they demonstrate substantially less stylistic diversity. Our findings suggest that, while LLMs may serve as useful tools for generating diverse annotations, they should not be viewed as drop-in replacements for human annotators — particularly in applications requiring authentic representation of diversity in human judgments, such as NLI.
Anthology ID:
2026.acl-long.752
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16526–16543
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.752/
DOI:
Bibkey:
Cite (ACL):
Artur Kulmizev, Erika Lombart, Patrick Watrin, and Marie-Catherine de Marneffe. 2026. Label and Explanation Variation in LLM-Based Annotation: a Case Study in Natural Language Inference. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16526–16543, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Label and Explanation Variation in LLM-Based Annotation: a Case Study in Natural Language Inference (Kulmizev et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.752.pdf
Checklist:
 2026.acl-long.752.checklist.pdf