Julian Rodemann


2025

pdf bib
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework
Esteban Garces Arias | Hannah Blocher | Julian Rodemann | Meimingwei Li | Christian Heumann | Matthias Aßenmacher
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) language models. However, evaluating the quality of these models and the employed decoding strategies remains challenging due to trade-offs among widely used metrics such as coherence, diversity, and perplexity. This paper addresses the specific problem of multicriteria evaluation for open-ended text generation, proposing novel methods for both relative and absolute rankings of decoding methods. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Our experiments demonstrate that the proposed approaches offer a robust way to compare decoding strategies and serve as valuable tools to guide model selection for open-ended text generation tasks. We suggest future directions for improving evaluation methodologies in text generation and make our code, datasets, and models publicly available.

pdf bib
The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models
Esteban Garces Arias | Julian Rodemann | Christian Heumann
Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)

Understanding uncertainty in large language models remains a fundamental challenge, particularly in creative tasks where multiple valid outputs exist. We present a geometric framework using credal sets—convex hulls of probability distributions—to quantify and decompose uncertainty in neural text generation, calibrated against human creative variation. Analyzing 500 creative writing prompts from the dataset with 10 unique human continuations each, we evaluate four language models across five decoding strategies, generating 100,000 stories. Our credal set analysis reveals substantial gaps in capturing human creative variation, with the best model-human calibration reaching only 0.434 (Gemma-2B with temperature 0.7). We decompose total uncertainty into epistemic and aleatoric components, finding that the choice of decoding strategy contributes 39.4% to 72.0% of total epistemic uncertainty. Model scale shows weak correlation with calibration quality and no significant difference exists between base and instruction-tuned models in calibration quality. Our geometric framework provides actionable insights for improving generation systems for human-AI creative alignment. We release our complete experimental framework at https://github.com/EstebanGarces/uncertainHuman.

2024

pdf bib
Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation
Esteban Garces Arias | Julian Rodemann | Meimingwei Li | Christian Heumann | Matthias Aßenmacher
Findings of the Association for Computational Linguistics: EMNLP 2024

Despite the remarkable capabilities of large language models, generating high-quality text remains a challenging task. Numerous decoding strategies—such as beam search, sampling with temperature, top‐k sampling, nucleus (top‐p) sampling, typical decoding, contrastive decoding, and contrastive search—have been proposed to address these challenges by improving coherence, diversity, and resemblance to human-generated text. In this study, we introduce Adaptive Contrastive Search (ACS), a novel decoding strategy that extends contrastive search (CS) by incorporating an adaptive degeneration penalty informed by the model’s estimated uncertainty at each generation step. ACS aims to enhance creativity and diversity while maintaining coherence to produce high-quality outputs. Extensive experiments across various model architectures, languages, and datasets demonstrate that our approach improves both creativity and coherence, underscoring its effectiveness in text-generation tasks. We release our code, datasets, and models to facilitate further research.