Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation

Beiduo Chen, Yang Janet Liu, Anna Korhonen, Barbara Plank


Abstract
The recent rise of reasoning-tuned Large Language Models (LLMs)—which generate chains of thought (CoTs) before giving the final answer—has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance.Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a *reverse* paradigm: producing explanations based on given answers. In contrast, CoTs provide a *forward* reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions.Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.
Anthology ID:
2025.emnlp-main.1682
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
33099–33123
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1682/
DOI:
Bibkey:
Cite (ACL):
Beiduo Chen, Yang Janet Liu, Anna Korhonen, and Barbara Plank. 2025. Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33099–33123, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation (Chen et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1682.pdf
Checklist:
 2025.emnlp-main.1682.checklist.pdf