Jürgen Pfeffer

2025

pdf bib abs
Digital Gatekeepers: Google’s Role in Curating Hashtags and Subreddits
Amrit Poudel | Yifan Ding | Tim Weninger | Jürgen Pfeffer
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Search engines play a crucial role as digital gatekeepers, shaping the visibility of Web and social media content through algorithmic curation. This study investigates how search engines like Google selectively promotes or suppresses certain hashtags and subreddits, impacting the information users encounter. By comparing search engine results with nonsampled data from Reddit and Twitter/X, we reveal systematic biases in content visibility. Google’s algorithms tend to suppress subreddits and hashtags related to sexually explicit material, conspiracy theories, advertisements, and cryptocurrencies, while promoting content associated with higher engagement. These findings suggest that Google’s gatekeeping practices influence public discourse by curating the social media narratives available to users.

pdf bib abs
Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis
Angelina Parfenova | Andreas Marfurt | Jürgen Pfeffer | Alexander Denzler
Findings of the Association for Computational Linguistics: NAACL 2025

This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM-generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.

pdf bib abs
Measuring What Matters: Evaluating Ensemble LLMs with Label Refinement in Inductive Coding
Angelina Parfenova | Jürgen Pfeffer
Findings of the Association for Computational Linguistics: ACL 2025

Inductive coding traditionally relies on labor-intensive human efforts, who are prone to inconsistencies and individual biases. Although large language models (LLMs) offer promising automation capabilities, their standalone use often results in inconsistent outputs, limiting their reliability. In this work, we propose a framework that combines ensemble methods with code refinement methodology to address these challenges. Our approach integrates multiple smaller LLMs, fine-tuned via Low-Rank Adaptation (LoRA), and employs a moderator-based mechanism to simulate human consensus. To address the limitations of metrics like ROUGE and BERTScore, we introduce a composite evaluation metric that combines code conciseness and contextual similarity. The validity of this metric is confirmed through correlation analysis with human expert ratings. Results demonstrate that smaller ensemble models with refined outputs consistently outperform other ensembles, individual models, and even large-scale LLMs like GPT-4. Our evidence suggests that smaller ensemble models significantly outperform larger standalone language models, pointing out the risk of relying solely on a single large model for qualitative analysis.

2024

pdf bib abs
SPIN: Sparsifying and Integrating Internal Neurons in Large Language Models for Text Classification
Difan Jiao | Yilun Liu | Zhenwei Tang | Daniel Matter | Jürgen Pfeffer | Ashton Anderson
Findings of the Association for Computational Linguistics: ACL 2024

Among the many tasks that Large Language Models (LLMs) have revolutionized is text classification. Current text classification paradigms, however, rely solely on the output of the final layer in the LLM, with the rich information contained in internal neurons largely untapped. In this study, we present SPIN: a model-agnostic framework that sparsifies and integrates internal neurons of intermediate layers of LLMs for text classification. Specifically, SPIN sparsifies internal neurons by linear probing-based salient neuron selection layer by layer, avoiding noise from unrelated neurons and ensuring efficiency. The cross-layer salient neurons are then integrated to serve as multi-layered features for the classification head. Extensive experimental results show our proposed SPIN significantly improves text classification accuracy, efficiency, and interpretability.

pdf bib abs
The Language of Trauma: Modeling Traumatic Event Descriptions Across Domains with Explainable AI
Miriam Schirmer | Tobias Leemann | Gjergji Kasneci | Jürgen Pfeffer | David Jurgens
Findings of the Association for Computational Linguistics: EMNLP 2024

Psychological trauma can manifest following various distressing events and is captured in diverse online contexts. However, studies traditionally focus on a single aspect of trauma, often neglecting the transferability of findings across different scenarios. We address this gap by training various language models with progressing complexity on trauma-related datasets, including genocide-related court data, a Reddit dataset on post-traumatic stress disorder (PTSD), counseling conversations, and Incel forum posts. Our results show that the fine-tuned RoBERTa model excels in predicting traumatic events across domains, slightly outperforming large language models like GPT-4. Additionally, SLALOM-feature scores and conceptual explanations effectively differentiate and cluster trauma-related language, highlighting different trauma aspects and identifying sexual abuse and experiences related to death as a common traumatic event across all datasets. This transferability is crucial as it allows for the development of tools to enhance trauma detection and intervention in diverse populations and settings.

2022

pdf bib abs
Hate Speech Classification in Bulgarian
Radoslav Ralev | Jürgen Pfeffer
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

In recent years, we have seen a surge in the propagation of online hate speech on social media platforms. According to a multitude of sources such as the European Council, hate speech can lead to acts of violence and conflict on a broader scale. That has led to in- creased awareness by governments, companies, and the scientific community, and although the field is relatively new, there have been considerable advancements in the field as a result of the collective effort. Despite the increasingly better results, most of the research focuses on the more popular languages (i.e., English, German, or Arabic), whereas less popular languages such as Bulgarian and other Balkan languages have been neglected. We have aggregated a real-world dataset from Bulgarian online forums and manually annotated 108,142 sentences. About 1.74% of which can be described with the categories racism, sexism, rudeness, and profanity. We then developed and evaluated various classifiers on the dataset and found that a support vector machine with a linear kernel trained on character-level TF-IDF features is the best model. Our work can be seen as another piece in the puzzle to building a strong foundation for future work on hate speech classification in Bulgarian.

Jürgen Pfeffer

2025

2024

2022

2014

Co-authors

Venues