Haibo Tong
2026
CogToM: A Comprehensive Theory of Mind Benchmark inspired by Human Cognition for Large Language Models
Haibo Tong | Zeyang Yue | Feifei Zhao | Erliang Lin | Lu Jia | Ruolin Chen | Yinqian Sun | Qian Zhang | Yi Zeng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haibo Tong | Zeyang Yue | Feifei Zhao | Erliang Lin | Lu Jia | Ruolin Chen | Yinqian Sun | Qian Zhang | Yi Zeng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Whether Large Language Models (LLMs) truly possess human-like Theory of Mind (ToM) capabilities has garnered increasing attention. However, existing benchmarks remain largely restricted to narrow paradigms like false belief tasks, failing to capture the full spectrum of human cognitive mechanisms. We introduce **CogToM**, a comprehensive, theoretically grounded benchmark comprising over 8000 bilingual instances across 46 paradigms, validated by 49 human annotators. A systematic evaluation of 22 representative models, including frontier models like GPT-5.1 and Qwen3-Max, reveals significant performance heterogeneities and highlights persistent bottlenecks in specific dimensions. Further analysis based on human cognitive patterns suggests potential divergences between LLM and human cognitive structures. CogToM offers a robust instrument and perspective for investigating the evolving cognitive boundaries of LLMs. We release our code and data at https://github.com/Beijing-AISI/CogToM.
2025
GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
Yiyang Zhou | Linjie Li | Shi Qiu | Zhengyuan Yang | Yuyang Zhao | Siwei Han | Yangfan He | Kangqi Li | Haonian Ji | Zihao Zhao | Haibo Tong | Lijuan Wang | Huaxiu Yao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yiyang Zhou | Linjie Li | Shi Qiu | Zhengyuan Yang | Yuyang Zhao | Siwei Han | Yangfan He | Kangqi Li | Haonian Ji | Zihao Zhao | Haibo Tong | Lijuan Wang | Huaxiu Yao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Existing video benchmarks often resemble image-based benchmarks, with question types like “What actions does the person perform throughout the video?” or “What color is the woman’s dress in the video?” For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce , a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context—this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos. We publicly release our benchmark and code at https://github.com/aiming-lab/GLIMPSE.
A Lightweight Multi Aspect Controlled Text Generation Solution For Large Language Models
Chenyang Zhang | Jiayi Lin | Haibo Tong | Bingxuan Hou | Dongyu Zhang | Jialin Li | Junli Wang
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Chenyang Zhang | Jiayi Lin | Haibo Tong | Bingxuan Hou | Dongyu Zhang | Jialin Li | Junli Wang
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Multi-Aspect Controllable Text Generation (MCTG) introduces fine-grained multiple constraints in natural language generation, i.e. control attributes in topics, sentiments, and detoxification.MCTG demonstrates application prospects for trustworthy generation of Large Language Models (LLMs) but is limited by generalization issues.Existing work exploits additional structures and strategies for solutions, requiring LLMs’ modifications.To activate LLMs’ MCTG ability, we propose a lightweight MCTG pipeline based on data augmentation and instruction tuning.We analyze aspect bias and correlations in traditional datasets and address these concerns with augmented control attributes and sentences.Augmented datasets are feasible for instruction tuning.We conduct experiments for various LLMs backbone and parameter sizes, demonstrating general effectiveness on MCTG performance.
Disentangle to Decay: Linear Attention with Trainable Decay Factor
Haibo Tong | Chenyang Zhang | Jiayi Lin | Bingxuan Hou | Qingqing Hong | Junli Wang
Proceedings of the 31st International Conference on Computational Linguistics
Haibo Tong | Chenyang Zhang | Jiayi Lin | Bingxuan Hou | Qingqing Hong | Junli Wang
Proceedings of the 31st International Conference on Computational Linguistics
Linear attention enhances inference efficiency of Transformer and has attracted research interests as an efficient backbone of language models. Existing linear attention based models usually exploit decay factor based positional encoding (PE), where attention scores decay exponentially with increasing relative distance. However, most work manually designs a non-trainable decay factor of exponential calculation, which limits further optimization. Our analysis reveals directly training decay factor is unstable because of large gradients. To address this, we propose a novel PE for linear attention named Disentangle to Decay (D2D). D2D disentangles decay factor into two parts to achieve further optimization and stable training. Moreover, D2D can be transformed into recurrent form for efficient inference. Experiments demonstrate that D2D achieves stable training of decay factor, and enhances performance of linear attention in both normal context length and length extrapolation scenarios.