Bingxiang He
2026
Current Agents Fail to Leverage World Model as Tool for Foresight
Cheng Qian | Emre Can Acikgoz | Bingxuan Li | Xiusi Chen | Yuji Zhang | Bingxiang He | Qinyu Luo | Gokhan Tur | Dilek Hakkani-T\"ur | Yunzhu Li | Heng Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cheng Qian | Emre Can Acikgoz | Bingxuan Li | Xiusi Chen | Yuji Zhang | Bingxiang He | Qinyu Luo | Gokhan Tur | Dilek Hakkani-T\"ur | Yunzhu Li | Heng Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents’ capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.
CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
Jiayu Liu | Cheng Qian | Zhaochen Su | Qing Zong | Shijue Huang | Bingxiang He | Yi R. Fung
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiayu Liu | Cheng Qian | Zhaochen Su | Qing Zong | Shijue Huang | Bingxiang He | Yi R. Fung
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents’ ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce **CostBench**, a scalable, cost-centric benchmark designed to evaluate agents’ economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even *GPT-5* achieving less than 75% exact match rate on the hardest tasks, and performance further drops significantly under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.
Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning
Qi He | Cheng Qian | Xiusi Chen | Bingxiang He | Yi R. Fung | Heng Ji
Findings of the Association for Computational Linguistics: ACL 2026
Qi He | Cheng Qian | Xiusi Chen | Bingxiang He | Yi R. Fung | Heng Ji
Findings of the Association for Computational Linguistics: ACL 2026
Claim verification with large language models (LLMs) has recently attracted growing attention, due to their strong reasoning capabilities and transparent verification processes compared to traditional answer-only judgments. However, existing approaches to online claim verification, which requires iterative evidence retrieval and reasoning, still mainly rely on prompt engineering or pre-designed reasoning workflows, without unified training to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. The dynamic interaction between models and retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing its larger-scale model counterparts. Ablation studies further reveal the impact of reward components, and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification, and provide a foundation for future research.
2025
EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents
Cheng Qian | Peixuan Han | Qinyu Luo | Bingxiang He | Xiusi Chen | Yuji Zhang | Hongyi Du | Jiarui Yao | Xiaocheng Yang | Denghui Zhang | Yunzhu Li | Heng Ji
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cheng Qian | Peixuan Han | Qinyu Luo | Bingxiang He | Xiusi Chen | Yuji Zhang | Hongyi Du | Jiarui Yao | Xiaocheng Yang | Denghui Zhang | Yunzhu Li | Heng Ji
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language model agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench—a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks). Experiments show that EscapeAgent can execute action chains over 1,000 steps while maintaining logical coherence. It navigates and completes games with up to 40% fewer steps and hints, performs robustly across difficulty levels, and achieves higher action success rates with more efficient and innovative puzzle-solving strategies.
The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning
Bingxiang He | Ning Ding | Cheng Qian | Jia Deng | Ganqu Cui | Lifan Yuan | Haiwen Hong | Huan-ang Gao | Longtao Huang | Hui Xue | Huimin Chen | Zhiyuan Liu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Bingxiang He | Ning Ding | Cheng Qian | Jia Deng | Ganqu Cui | Lifan Yuan | Haiwen Hong | Huan-ang Gao | Longtao Huang | Hui Xue | Huimin Chen | Zhiyuan Liu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. To bridge this gap, we investigate zero-shot generalization from the perspective of the data itself. We first demonstrate that zero-shot generalization happens very early during instruction tuning, with loss serving as a stable indicator. Next, we investigate training data arrangement through similarity and granularity perspectives, confirming that the timing of exposure to certain training examples may greatly facilitate generalization on unseen tasks. Finally, we propose a more grounded training data arrangement framework, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level.
2024
Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents
Cheng Qian | Bingxiang He | Zhong Zhuang | Jia Deng | Yujia Qin | Xin Cong | Zhong Zhang | Jie Zhou | Yankai Lin | Zhiyuan Liu | Maosong Sun
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cheng Qian | Bingxiang He | Zhong Zhuang | Jia Deng | Yujia Qin | Xin Cong | Zhong Zhang | Jie Zhou | Yankai Lin | Zhiyuan Liu | Maosong Sun
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current language model-driven agents often lack mechanisms for effective user participation, which is crucial given the vagueness commonly found in user instructions. Although adept at devising strategies and performing tasks, these agents struggle with seeking clarification and grasping precise user intentions. To bridge this gap, we introduce Intention-in-Interaction (IN3), a novel benchmark designed to inspect users’ implicit intentions through explicit queries. Next, we propose the incorporation of model experts as the upstream in agent designs to enhance user-agent interaction. Employing IN3, we empirically train Mistral-Interact, a powerful model that proactively assesses task vagueness, inquires about user intentions, and refines them into actionable goals before starting downstream agent task execution. Integrating it into the XAgent framework, we comprehensively evaluate the enhanced agent system regarding user instruction understanding and execution, revealing that our approach notably excels at identifying vague user tasks, recovering and summarizing critical missing information, setting precise and necessary agent execution goals, and minimizing redundant tool usage, thus boosting overall efficiency.
2023
Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection via Querying ChatGPT
Biru Zhu | Lifan Yuan | Ganqu Cui | Yangyi Chen | Chong Fu | Bingxiang He | Yangdong Deng | Zhiyuan Liu | Maosong Sun | Ming Gu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Biru Zhu | Lifan Yuan | Ganqu Cui | Yangyi Chen | Chong Fu | Bingxiang He | Yangdong Deng | Zhiyuan Liu | Maosong Sun | Ming Gu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs), e.g., ChatGPT, have revolutionized the domain of natural language processing because of their excellent performance on various tasks. Despite their great potential, LLMs also incur serious concerns as they are likely to be misused. There are already reported cases of academic cheating by using LLMs. Thus, it is a pressing problem to identify LLM-generated texts. In this work, we design a zero-shot black-box method for detecting LLM-generated texts. The key idea is to revise the text to be detected using the ChatGPT model. Our method is based on the intuition that the ChatGPT model will make fewer revisions to LLM-generated texts than it does to human-written texts, because the texts generated by LLMs are more in accord with the generation logic and statistical patterns learned by LLMs like ChatGPT. Thus, if the text to be detected and its ChatGPT-revised version have a higher degree of similarity, the text is more likely to be LLM-generated. Extensive experiments on various datasets and tasks show that our method can effectively detect LLM-generated texts. Moreover, compared with other detection methods, our method has better generalization ability and is more stable across various datasets. The codes are publicly available at https://github.com/thunlp/LLM-generated-text-detection.
Search
Fix author
Co-authors
- Cheng Qian 6
- Xiusi Chen 3
- Heng Ji 3
- Maosong Sun (孙茂松) 3
- Ganqu Cui 2
- Jia Deng 2
- Yi R. Fung 2
- Yunzhu Li 2
- Zhiyuan Liu 2
- Qinyu Luo 2
- Lifan Yuan 2
- Yuji Zhang 2
- Emre Can Acikgoz 1
- Yangyi Chen 1
- Huimin Chen 1
- Xin Cong 1
- Yangdong Deng 1
- Ning Ding 1
- Hongyi Du 1
- Chong Fu 1
- Huan-ang Gao 1
- Ming Gu 1
- Dilek Hakkani-T\"ur 1
- Peixuan Han 1
- Qi He 1
- Haiwen Hong 1
- Longtao Huang 1
- Shijue Huang 1
- Bingxuan Li 1
- Yankai Lin (林衍凯) 1
- Zhiyuan Liu 1
- Jiayu Liu 1
- Yujia Qin 1
- Zhaochen Su 1
- Gokhan Tur 1
- Hui Xue 1
- Xiaocheng Yang 1
- Jiarui Yao 1
- Denghui Zhang 1
- Zhong Zhang 1
- Jie Zhou 1
- Biru Zhu 1
- Zhong Zhuang 1
- Qing Zong 1