Jianshu Zhang
Other people with similar names: Jianshu Zhang
Unverified author pages with similar names: Jianshu Zhang
2026
ProgressLM: Towards Progress Reasoning in Vision-Language Models
Jianshu Zhang | Chengxuan Qian | Haosen Sun | Haoran Lu | Dingcheng Wang | Letian Xue | Han Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jianshu Zhang | Chengxuan Qian | Haosen Sun | Haoran Lu | Dingcheng Wang | Letian Xue | Han Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Estimating task progress requires long-horizon and dynamic reasoning, going beyond static visual perception. Although Vision-Language Models (VLMs) excel at describing what is visible in a single observation, it remains unclear whether they can infer how far a task has progressed from partial information. To study this question, we introduce Progress-Bench, a benchmark with over 3K instances for evaluating progress reasoning from a single observation. We further examine a human-inspired two-stage paradigm that combines episodic retrieval with mental simulation. We instantiate this paradigm through both training-free prompting and a training-based approach using the automatically curated ProgressLM-45K dataset. Experiments on 14 VLMs show that most models struggle with reliable progress estimation, and that training-free reasoning provides only limited and model-dependent benefits. In contrast, the training-based ProgressLM-3B achieves consistent improvements in accuracy, robustness to viewpoint variation, and handling of unanswerable cases, despite its small scale. Additional analyses reveal common failure patterns in existing VLMs and clarify when and why progress reasoning succeeds or fails.
WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models
Rui Wang | Ce Zhang | Jun-Yu Ma | Jianshu Zhang | Hongru Wang | Yi Chen | Boyang Xue | Tianqing Fang | Zhisong Zhang | Hongming Zhang | Haitao Mi | Dong Yu | Kam-Fai Wong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rui Wang | Ce Zhang | Jun-Yu Ma | Jianshu Zhang | Hongru Wang | Yi Chen | Boyang Xue | Tianqing Fang | Zhisong Zhang | Hongming Zhang | Haitao Mi | Dong Yu | Kam-Fai Wong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The hallmark of Deep Research agents lies in compositional reasoning, the capacity to aggregate distributed, heterogeneous information into coherent logical insights. However, current agentic systems are often retrieval-heavy but reasoning-light, where success is predominantly determined by simple entity-seeking rather than the multi-step aggregation of scattered evidence. To address this, we propose a data synthesis pipeline WebAggregator, designed to shift the agentic paradigm from retrieval-centric to compositional aggregation. Our approach first employs Proactive Explorer to collect interconnected knowledge, then Compositional Logic Proposer to weave knowledge into complex questions using over 12 composition guidelines derived from a rigorous deconstruction of the Deep Research problem setting. Fine-tuning on this corpus fundamentally transforms agent behavior, fostering deliberate composition reasoning and reduced tool redundancy. The resulting WebAggregator-32B surpasses GPT-4.1 and matches Claude-3.7-Sonnet on GAIA, WebWalkerQA, and XBench. To address the lack of benchmarks that emphasize both reasoning and retrieval, we introduce the WebAggregatorQA testbed, which reveals that even with perfect retrieval, top-tier models still underperformed. These results demonstrate that compositional reasoning, not retrieval, is the true performance ceiling for next-generation research agents.
2025
WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback
Minda Hu | Tianqing Fang | Jianshu Zhang | Jun-Yu Ma | Zhisong Zhang | Jingyan Zhou | Hongming Zhang | Haitao Mi | Dong Yu | Irwin King
Findings of the Association for Computational Linguistics: EMNLP 2025
Minda Hu | Tianqing Fang | Jianshu Zhang | Jun-Yu Ma | Zhisong Zhang | Jingyan Zhou | Hongming Zhang | Haitao Mi | Dong Yu | Irwin King
Findings of the Association for Computational Linguistics: EMNLP 2025
Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection & lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent’s (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.
Bridge-Coder: Transferring Model Capabilities from High-Resource to Low-Resource Programming Language
Jipeng Zhang | Jianshu Zhang | Yuanzhe Li | Renjie Pi | Rui Pan | Runtao Liu | Zheng Ziqiang | Tong Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Jipeng Zhang | Jianshu Zhang | Yuanzhe Li | Renjie Pi | Rui Pan | Runtao Liu | Zheng Ziqiang | Tong Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Most LLMs universally excel at generating code for high-resource programming languages (HRPLs) like Python, a capability that has become standard due to the abundance of training data. However, they struggle significantly with low-resource programming languages (LRPLs) such as D, exacerbating the digital divide. This gap limits developers using LRPLs from equally benefiting and hinders innovation within underrepresented programming communities. To make matters worse, manually generating data for LRPLs is highly labor intensive and requires expensive expert effort. In this work, we begin by analyzing the NL-PL Gap, where LLMs’ direct-generated LRPL data often suffers from subpar quality due to the misalignment between natural language (NL) instructions and programming language (PL) outputs. To address this issue, we introduce Bridge-Assist Generation, a method to generate LRPL data utilizing LLM’s general knowledge, HRPL proficiency, and in-context learning capabilities. To further maximize the utility of the generated data, we propose Bridged Alignment to obtain Bridge-Coder. To thoroughly evaluate our approach, we select four relatively LRPLs: R, D, Racket, and Bash. Experimental results reveal that Bridge-Coder achieves significant improvements over the original model, with average gains of 18.71 and 10.81 on two comprehensive benchmarks, M-HumanEval and M-MBPP.
VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
Jianshu Zhang | Dongyu Yao | Renjie Pi | Paul Pu Liang | Yi R. Fung
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jianshu Zhang | Dongyu Yao | Renjie Pi | Paul Pu Liang | Yi R. Fung
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Visually linking matching cues is a crucial ability in daily life, such as identifying the same person in multiple photos based on their cues, even without knowing who they are. Despite the extensive knowledge that vision-language models (VLMs) possess, it remains largely unexplored whether they are capable of performing this fundamental task. To address this, we introduce VLM2-Bench, a benchmark designed to assess whether VLMs can Visually Link Matching cues, with 9 subtasks and over 3,000 test cases. Comprehensive evaluation across twelve VLMs, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings. We identify critical challenges in models’ ability to link visual cues, highlighting a significant performance gap. Based on these insights, we advocate for (i) enhancing core visual capabilities to improve adaptability and reduce reliance on prior knowledge, (ii) establishing clearer principles for integrating language-based reasoning in vision-centric tasks to prevent unnecessary biases, and (iii) shifting vision-text training paradigms toward fostering models’ ability to independently structure and infer relationships among visual cues.
Search
Fix author
Co-authors
- Tianqing Fang 2
- Jun-Yu Ma 2
- Haitao Mi 2
- Renjie Pi 2
- Dong Yu (于东) 2
- Hongming Zhang 2
- Zhisong Zhang 2
- Yi Chen 1
- Yi R. Fung 1
- Minda Hu 1
- Irwin King 1
- Yuanzhe Li 1
- Paul Pu Liang 1
- Han Liu 1
- Runtao Liu 1
- Haoran Lu 1
- Rui Pan 1
- Chengxuan Qian 1
- Haosen Sun 1
- Dingcheng Wang 1
- Hongru Wang 1
- Rui Wang 1
- Kam-Fai Wong 1
- Boyang Xue 1
- Letian Xue 1
- Dongyu Yao 1
- Ce Zhang 1
- Jipeng Zhang 1
- Tong Zhang 1
- Jingyan Zhou 1
- Zheng Ziqiang 1