Measuring What Matters: Evaluating Ensemble LLMs with Label Refinement in Inductive Coding

Angelina Parfenova; Jürgen Pfeffer

Measuring What Matters: Evaluating Ensemble LLMs with Label Refinement in Inductive Coding

Abstract

Inductive coding traditionally relies on labor-intensive human efforts, who are prone to inconsistencies and individual biases. Although large language models (LLMs) offer promising automation capabilities, their standalone use often results in inconsistent outputs, limiting their reliability. In this work, we propose a framework that combines ensemble methods with code refinement methodology to address these challenges. Our approach integrates multiple smaller LLMs, fine-tuned via Low-Rank Adaptation (LoRA), and employs a moderator-based mechanism to simulate human consensus. To address the limitations of metrics like ROUGE and BERTScore, we introduce a composite evaluation metric that combines code conciseness and contextual similarity. The validity of this metric is confirmed through correlation analysis with human expert ratings. Results demonstrate that smaller ensemble models with refined outputs consistently outperform other ensembles, individual models, and even large-scale LLMs like GPT-4. Our evidence suggests that smaller ensemble models significantly outperform larger standalone language models, pointing out the risk of relying solely on a single large model for qualitative analysis.

Anthology ID:: 2025.findings-acl.563
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10803–10816
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.findings-acl.563/
DOI:
Bibkey:
Cite (ACL):: Angelina Parfenova and Jürgen Pfeffer. 2025. Measuring What Matters: Evaluating Ensemble LLMs with Label Refinement in Inductive Coding. In Findings of the Association for Computational Linguistics: ACL 2025, pages 10803–10816, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Measuring What Matters: Evaluating Ensemble LLMs with Label Refinement in Inductive Coding (Parfenova & Pfeffer, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.findings-acl.563.pdf

PDF Cite Search Fix data