Rishabh Sabharwal

2026

Verbal humor involves reasoning through complex conversational contexts. Although LLMs have achieved strong performance on English humor datasets, their ability to interpret humor in Hindi remains unexplored. In this paper, we evaluate Hindi humor for which we extract dialogues from humorous video clips. We use a pipeline that transforms video content into detailed textual streams, including dialogue transcripts and scene descriptions, allowing reasoning over inputs exceeding 2,000 words. We test various LLMs, from efficient edge models (Qwen-2.5-7B, Qwen-3-7B, Gemma-3-27B) to Indic-focused models (Sarvam-M-24B) and large frontier models (Llama-3.1-70B, Gemini-2.0-Flash). Our findings show a concave performance pattern in long-context understanding, with reasoning quality peaking at moderate lengths (250–750 words) and declining at higher context lengths. We also show that standard metrics overstate pragmatic competence. While increasing model size generally improves performance, we also observe distinct failures in smaller LLMs due to instructional and linguistic issues, necessitating diversity metrics to capture hallucinations. Smaller, Hindi-focused models can compete with much larger generalist models. Importantly, our evaluation reveals that conversational humor is a challenge for even specialized models, making HinS a valuable benchmark for advancing research in Hindi Long-Context Humor Reasoning.

Co-authors

Punit Rathore 1

Navya Shrivastava 1

Venues

C3NLP1
WS1

Fix author