Haitao Li
2026
VideoPro: Adaptive Program Reasoning for Long Video Understanding
Chenglin Li | Feng Han | Yikun Wang | Ruilin Li | Shuai Dong | Haowen Hou | Haitao Li | Qianglong Chen | Feng Tao | Jingqi Tong | Yin Zhang | Jiaqi Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenglin Li | Feng Han | Yikun Wang | Ruilin Li | Shuai Dong | Haowen Hou | Haitao Li | Qianglong Chen | Feng Tao | Jingqi Tong | Yin Zhang | Jiaqi Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Understanding long videos remains challenging due to the sparsity of visual evidence relevant to a given query. Prior work has explored program-based visual grounding, typically relying on executable programs generated by auxiliary large language models. However, when scaling to long videos, existing approaches face several critical limitations: (1) frame-centric vision modules are often insufficient for long video processing; (2) naively applying program-based reasoning to all queries incurs considerable computational overhead; and (3) errors arising from low-confidence predictions and imperfect program execution are difficult to recover from. To address these challenges, we propose VideoPro, a unified framework that enables VideoLLMs to adaptively reason over long videos and refine their predictions through executable programs. VideoPro first performs adaptive reasoning, dynamically determining whether a query can be resolved directly by the native VideoLLM or requires explicit multi-step program reasoning. For complex queries, the model decomposes the task into executable programs that invoke specialized vision modules for precise temporal and semantic grounding. To further improve robustness, VideoPro incorporates a self-refinement mechanism that leverages execution feedback and confidence signals to correct erroneous executions and refine low-confidence reasoning programs. By tightly integrating adaptive reasoning with self-refinement, VideoPro consistently outperforms prior methods across multiple long-video understanding benchmarks, yielding an average 6.7% improvement for Qwen3-VL-8B.
Beyond Experience Retrieval: Learning to Generate Utility-Optimized Structured Experience for Frozen LLMs
Xuancheng Li | Haitao Li | Yujia Zhou | Yiqun Liu | Qingyao Ai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xuancheng Li | Haitao Li | Yujia Zhou | Yiqun Liu | Qingyao Ai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are largely static and often redo reasoning or repeat mistakes. Prior experience reuse typically relies on external retrieval, which is similarity-based, can introduce noise, and adds latency. We introduce SEAM (Structured Experience Adapter Module), a lightweight, executor-specific plug-in that stores experience in its parameters and generates a structured, instance-tailored experience entry in a single forward pass to guide a frozen LLM executor. SEAM is trained for utility via executor rollouts and GRPO while keeping the executor frozen, and can be further improved with logged-success SFT after deployment. Experiments on mathematical reasoning benchmarks show consistent accuracy gains across executors with low overhead. Extensive ablation and analysis further elucidate the mechanisms underlying SEAM’s effectiveness and robustness.[We release our code at <https://anonymous.4open.science/r/SEAM>.]
2025
SelfRACG: Enabling LLMs to Self-Express and Retrieve for Code Generation
Qian Dong | Jia Chen | Qingyao Ai | Hongning Wang | Haitao Li | Yiwu | Yao Hu | Yiqun Liu | Shaoping Ma
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Qian Dong | Jia Chen | Qingyao Ai | Hongning Wang | Haitao Li | Yiwu | Yao Hu | Yiqun Liu | Shaoping Ma
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Existing retrieval-augmented code generation (RACG) methods typically use an external retrieval module to fetch semantically similar code snippets used for generating subsequent fragments. However, even for consecutive code fragments, the content often diverges due to logical progression, resulting in a content gap. This gap undermines the performance of current RACG methods, as external retrieval modules based on content matching fail to infer the specific information need of LLMs to generate the next code fragment. Therefore, we propose SelfRACG, a novel paradigm that enables large language models (LLMs) to Self-express their information needs to enhance RACG. Specifically, SelfRACG includes an information need expression module and a two-stage information need-guided training strategy, which encourages LLMs to express their information need. Extensive experiments demonstrate that SelfRACG can retrieve external knowledge that better aligns with the LLM’s own information needs, resulting in superior generation performance compared to vanilla RACG. Moreover, both the training and deployment costs for retrieval in our framework are much lower than those of the strongest retrieval model.
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges
Haitao Li | Junjie Chen | Qingyao Ai | Zhumin Chu | Yujia Zhou | Qian Dong | Yiqun Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haitao Li | Junjie Chen | Qingyao Ai | Zhumin Chu | Yujia Zhou | Qian Dong | Yiqun Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The use of large language models (LLMs) as automated evaluation tools to assess the quality of generated natural language, known as ”LLMs-as-Judges”, has demonstrated promising capabilities and is rapidly gaining widespread attention. However, when applied to pairwise comparisons of candidate responses, LLM-based evaluators often exhibit selection bias. Specifically, their judgments may become inconsistent when the option positions or ID tokens are swapped, compromising the effectiveness and fairness of the evaluation result. To address this challenge, we introduce CalibraEval, a novel label-free method for mitigating selection bias during inference. Specifically, CalibraEval reformulates debiasing as an optimization task aimed at adjusting observed prediction distributions to align with unbiased prediction distributions. To solve this optimization problem, we propose a non-parametric order-preserving algorithm (NOA). This algorithm leverages the partial order relationships between model prediction distributions, thereby eliminating the need for explicit labels and precise mathematical function modeling. Empirical evaluations of LLMs in multiple representative benchmarks demonstrate that CalibraEval effectively mitigates selection bias and improves performance compared to existing debiasing methods. This work marks a step toward building more robust and unbiased automated evaluation frameworks, paving the way for improved reliability in AI-driven assessments. The code can be found at https://github.com/CSHaitao/CalibraEval.
LegalAgentBench: Evaluating LLM Agents in Legal Domain
Haitao Li | Junjie Chen | Jingli Yang | Qingyao Ai | Wei Jia | Youfeng Liu | Kai Lin | Yueyue Wu | Guozhi Yuan | Yiran Hu | Wuyue Wang | Yiqun Liu | Minlie Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haitao Li | Junjie Chen | Jingli Yang | Qingyao Ai | Wei Jia | Youfeng Liu | Kai Lin | Yueyue Wu | Guozhi Yuan | Yiran Hu | Wuyue Wang | Yiqun Liu | Minlie Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the increasing intelligence and autonomy of LLM Agents, their potential applications in the legal domain are becoming increasingly apparent. However, existing general-domain benchmarks are unable to fully capture the complexity and subtle nuances inherent in real-world judicial cognition and decision-making. Therefore, we propose LegalAgentBench, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge. To cover tasks of varying difficulty and types, we designed a scalable task construction process that enables a more precise evaluation of performance in both tool utilization and reasoning. Moreover, Beyond assessing performance through the success rate of final outcomes, LegalAgentBench incorporates keyword analysis during intermediate processes to calculate progress rates, facilitating a more fine-grained evaluation. We evaluated eight popular LLMs, highlighting the strengths, limitations, and potential areas for improvement of existing models and methods. LegalAgentBench sets a new benchmark for the practical application of LLMs in the legal domain, with its code and data available at https://github.com/CSHaitao/LegalAgentBench.
Search
Fix author
Co-authors
- Qingyao Ai 4
- Yiqun Liu 4
- Junjie Chen 2
- Qian Dong 2
- Yujia Zhou 2
- Jia Chen 1
- Qianglong Chen 1
- Zhumin Chu 1
- Shuai Dong 1
- Yiran HU 1
- Feng Han 1
- Haowen Hou 1
- Yao Hu 1
- Minlie Huang 1
- Wei Jia 1
- Chenglin Li 1
- Ruilin Li 1
- Xuancheng Li 1
- Kai Lin 1
- Youfeng Liu 1
- Shaoping Ma 1
- Feng Tao 1
- Jingqi Tong 1
- Hongning Wang 1
- Jiaqi Wang 1
- Wuyue Wang 1
- Yikun Wang 1
- Yueyue Wu 1
- Jingli Yang 1
- Yiwu 1
- Guozhi Yuan 1
- Yin Zhang 1