Julian Schelb


2026

The use of large language models (LLMs) for generating responses to multiple-choice style questionnaires that were originally intended to be answered by humans is often a helpful or even necessary task, for example in persona simulation or during LLM alignment. Although the input and output versatility of generative LLMs is beneficial when adapting such questionnaires to machine use, it can be detrimental when mapping the generated text back to a closed set of possible answer options for evaluation or scoring. In this paper, we investigate the performance of smaller models for the classification of LLM outputs into the available answer options of multiple-choice questionnaires. We consider fine-tuned encoder-transformers as well as a rule-based approach on three datasets with differing answer option complexity. Surprisingly, we find that the best-performing neural approach still underperforms in comparison to our rule-based baseline, indicating that simple pattern-matching of answer options against LLM outputs might still be the most competitive solution for cleaning LLM responses to multiple-choice questionnaires.

2024

Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.