Michael Horn

2026

Qualitative analysis is critical to understanding human datasets in many social science disciplines. A central method in this process is inductive coding, where researchers identify and interpret codes directly from the datasets themselves. Yet, this exploratory approach poses challenges for meeting methodological expectations (such as "depth" and "variation"), especially as researchers increasingly adopt Generative AI (GAI) for support. Ground-truth-based metrics are insufficient because they contradict the exploratory nature of inductive coding; cluster- or topic-level metrics fail to capture the interpretive, cross-cutting nature of qualitative codes; and manual evaluation can be labor-intensive. This paper presents a theory-informed computational method for measuring inductive coding results from humans and GAI. Our method first merges individual codebooks into an Aggregated Code Space using an LLM-enriched hierarchical clustering algorithm. It then measures each coder’s contribution against the merged result using four novel metrics: Coverage, Overlap, Novelty, and Divergence, designed to capture breadth, consensus, unique contribution, and systematic deviation without assuming ground truth. Through two experiments on a human-coded online conversation dataset, we 1) reveal the merging algorithm’s impact on metrics; 2) validate the metrics’ stability and robustness across multiple runs and different LLMs; and 3) showcase the metrics’ ability to diagnose coding issues, such as excessive or irrelevant (hallucinated) codes. We discuss how these metrics should be interpreted in combination and their current limitations. Our work provides a reliable pathway for ensuring methodological rigor in human-AI qualitative analysis.

Co-authors

Uri Wilensky 1

Yanjia Zhang 1

Lexie Zhao 1

Venues

Findings1

Fix author