v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

Zhengpeng Shi; Yanpeng Zhao; Jianqun Zhou; Yuxuan Wang; Qinrong Cui; Victoria W.; Song-Chun Zhu; Bo Zhao; Zilong Zheng

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Song-Chun Zhu, Bo Zhao, Zilong Zheng

Abstract

AI models capable of comprehending humor hold real-world promise—for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel video humor understanding benchmark. v-HUB comprises a curated collection of non-verbal short videos, reflecting real-world scenarios where humor can be appreciated purely through visual cues. We pair each video clip with rich annotations to support a variety of evaluation tasks and analyses, including a novel study of environmental sound that can enhance humor. To broaden its applicability, we construct an open-ended QA task, making v-HUB readily integrable into existing video understanding task suites. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can natively process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the promise of integrating richer modalities for complex video understanding tasks.

Anthology ID:: 2026.acl-long.1785
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 38544–38567
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1785/
DOI:
Bibkey:
Cite (ACL):: Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Song-Chun Zhu, Bo Zhao, and Zilong Zheng. 2026. v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 38544–38567, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound (Shi et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1785.pdf
Checklist:: 2026.acl-long.1785.checklist.pdf

PDF Cite Search Checklist Fix data