Andrew Hoblitzell


2026

This paper describes the BAHAHA system for SemEval-2026 Task 1: MWAHAHA, which requires generating original jokes given either a news headline or a pair of rare words. Our approach uses a generate-then-rank pipeline, combining multi-style candidate generation via comedian-inspired few-shot prompting. We perform quality assessment from a smaller model fine-tuned on synthetic rating data from the generation model. Specifically, we produce up to 50 candidates per input across 15 stylistic templates and select outputs through a mixed-initiative interface that combines automated ranking with human judgment. There were 305 participants and 180 submissions in the contest. Our system ranks 2nd on Subtask A Chinese and 5th on Subtasks B1 and B2. The system generates jokes natively in each language rather than through translation.
Current hallucination detection systems operate under a flawed assumption: that model outputs deviating from factual grounding are uniformly problematic regardless of task context, modality, or cultural setting. Through analysis of computational humor as a motivating case study, we demonstrate that identical model behaviors require radically different evaluations depending on context. We propose reframing hallucination detection as task-output alignment assessment, introducing a three-dimensional framework spanning factual grounding requirements, novelty requirements, and risk tolerance.

2024

Hallucinations in large language models(LLMs) have recently become a significantproblem. A recent effort in this directionis a shared task at Semeval 2024 Task 6,SHROOM, a Shared-task on Hallucinationsand Related Observable Overgeneration Mis-takes. This paper describes our winning so-lution ranked 1st and 2nd in the 2 sub-tasksof model agnostic and model aware tracks re-spectively. We propose a meta-regressor basedensemble of LLMs based on a random forestalgorithm that achieves the highest scores onthe leader board. We also experiment with var-ious transformer based models and black boxmethods like ChatGPT, Vectara, and others. Inaddition, we perform an error analysis com-paring ChatGPT against our best model whichshows the limitations of the former