2025
pdf
bib
abs
Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis
Angelina Parfenova
|
Andreas Marfurt
|
Jürgen Pfeffer
|
Alexander Denzler
Findings of the Association for Computational Linguistics: NAACL 2025
This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM-generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.
pdf
bib
abs
Measuring What Matters: Evaluating Ensemble LLMs with Label Refinement in Inductive Coding
Angelina Parfenova
|
Jürgen Pfeffer
Findings of the Association for Computational Linguistics: ACL 2025
Inductive coding traditionally relies on labor-intensive human efforts, who are prone to inconsistencies and individual biases. Although large language models (LLMs) offer promising automation capabilities, their standalone use often results in inconsistent outputs, limiting their reliability. In this work, we propose a framework that combines ensemble methods with code refinement methodology to address these challenges. Our approach integrates multiple smaller LLMs, fine-tuned via Low-Rank Adaptation (LoRA), and employs a moderator-based mechanism to simulate human consensus. To address the limitations of metrics like ROUGE and BERTScore, we introduce a composite evaluation metric that combines code conciseness and contextual similarity. The validity of this metric is confirmed through correlation analysis with human expert ratings. Results demonstrate that smaller ensemble models with refined outputs consistently outperform other ensembles, individual models, and even large-scale LLMs like GPT-4. Our evidence suggests that smaller ensemble models significantly outperform larger standalone language models, pointing out the risk of relying solely on a single large model for qualitative analysis.
pdf
bib
abs
Analyzing the Evolution of Scientific Misconduct Based on the Language of Retracted Papers
Christof Bless
|
Andreas Waldis
|
Angelina Parfenova
|
Maria A. Rodriguez
|
Andreas Marfurt
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Amid rising numbers of organizations producing counterfeit scholarly articles, it is important to quantify the prevalence of scientific misconduct.We assess the feasibility of automated text-based methods to determine the rate of scientific misconduct by analyzing linguistic differences between retracted and non-retracted papers.We find that retracted works show distinct phrase patterns and higher word repetition.Motivated by this, we evaluatetwo misconduct detection methods, a mixture distribution approach and a Transformer-based one.The best models achieve high accuracy (>0.9 F1) on detection of paper mill articles and automatically generated content, making them viable tools for flagging papers for closer review.We apply the classifiers to more than 300,000 paper abstracts, to quantify misconduct over time and find that our estimation methods accurately reproduce trends observed in the real data.
2024
pdf
bib
abs
Automating Qualitative Data Analysis with Large Language Models
Angelina Parfenova
|
Alexander Denzler
|
Jörgen Pfeffer
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
This PhD proposal aims to investigate ways of automating qualitative data analysis, specifically the thematic coding of texts. Despite existing methods vastly covered in literature, they mainly use Topic Modeling and other quantitative approaches which are far from resembling a human’s analysis outcome. This proposal examines the limitations of current research in the field. It proposes a novel methodology based on Large Language Models to tackle automated coding and make it as close as possible to the results of human researchers. This paper covers studies already done in this field and their limitations, existing software, the problem of duplicating the researcher bias, and the proposed methodology.