Kai Han

Other people with similar names: Kai Han


2025

pdf bib
GAMEBoT: Transparent Assessment of LLM Reasoning in Games
Wenye Lin | Jonathan Roberts | Yunhan Yang | Samuel Albanie | Zongqing Lu | Kai Han
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) are increasingly deployed in real-world applications that demand complex reasoning. To track progress, robust benchmarks are required to evaluate their capabilities beyond superficial pattern recognition. However, current LLM reasoning benchmarks often face challenges such as insufficient interpretability, performance saturation or data contamination. To address these challenges, we introduce GAMEBoT, a gaming arena designed for rigorous and transparent assessment of LLM reasoning capabilities. GAMEBoT decompose complex reasoning in games into predefined modular subproblems. This decomposition allows us to design a suite of Chain-of-Thought (CoT) prompts infused with domain knowledge to guide LLMs in addressing these subproblems before action selection. Furthermore, we develop a suite of rule-based algorithms to generate ground truth for these subproblems, enabling rigorous validation of the LLMs’ intermediate reasoning steps. This approach facilitates evaluation of both the quality of final actions and the accuracy of the underlying reasoning process. GAMEBoT also naturally alleviates the risk of data contamination through dynamic games and head-to-head LLM competitions. We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics. Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.

pdf bib
PruneVid: Visual Token Pruning for Efficient Video Large Language Models
Xiaohu Huang | Hao Zhou | Kai Han
Findings of the Association for Computational Linguistics: ACL 2025

We introduce PruneVid, a training-free visual token pruning method designed to enhance the efficiency of multimodal video understanding. While Large Language Models (LLMs) have shown promising performance on video tasks due to their advanced visual comprehension capabilities, the substantial redundancy inherent in video data poses significant computational challenges. To address this issue, PruneVid (1) reduces intrinsic video redundancy by merging temporally static and spatially similar tokens, and (2) leverages LLMs’ inherent ability to selectively prune visual tokens irrelevant to specific queries, thereby improving model efficiency. We validate our method across multiple video benchmarks, demonstrating that PruneVid can prune over 80% of tokens while maintaining competitive performance when combined with different video LLMs. Our results highlight PruneVid’s superior effectiveness and efficiency compared to existing pruning methods.