Does Bigger Mean Funnier? Evaluating Humor Generation Across the Qwen3 Model Family

Jatin Agrawal, Radhika Mamidi


Abstract
We investigate whether scaling model parameters improves humor generation through a controlled ablation study. Using five Qwen3 variants (8B–235B, dense and MoE), we generate jokes across 50 themes. Beyond evaluating humor scaling, this work serves as an empirical study into the nature of LLM versus human evaluations on highly subjective creative tasks. While an automated judge yields a perfect monotonic ranking between parameter count and win rate, human annotators find no significant aggregate difference in humor quality. Restricting to themes where annotators agree reveals a significant preference for the largest model (p = 0.039), suggesting scaling effects exist but are masked by a "quality floor." Crucially, our analysis of bias characteristics shows that the automated judge exhibits severe positional and length biases compared to human evaluators, further suggesting that LLMs may systematically distort quality differences on subjective tasks.
Anthology ID:
2026.chum-1.7
Volume:
Proceedings of the 2nd Workshop on Computational Humor (CHum 2026)
Month:
July
Year:
2026
Address:
Online
Editors:
Ori Amir, Christian F. Hempelmann, Julia Rayz, Tiansi Dong, Tristan Miller
Venues:
chum | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
81–94
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.chum-1.7/
DOI:
Bibkey:
Cite (ACL):
Jatin Agrawal and Radhika Mamidi. 2026. Does Bigger Mean Funnier? Evaluating Humor Generation Across the Qwen3 Model Family. In Proceedings of the 2nd Workshop on Computational Humor (CHum 2026), pages 81–94, Online. Association for Computational Linguistics.
Cite (Informal):
Does Bigger Mean Funnier? Evaluating Humor Generation Across the Qwen3 Model Family (Agrawal & Mamidi, chum 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.chum-1.7.pdf