Jianhong Tu
2026
ToolRM: Towards Agentic Tool-Use Reward Modeling
Renhao Li | Jianhong Tu | Yang Su | Yantao Liu | Fei Huang | Hamid Alinejad-Rokny | Derek F. Wong | Junyang Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Renhao Li | Jianhong Tu | Yang Su | Yantao Liu | Fei Huang | Hamid Alinejad-Rokny | Derek F. Wong | Junyang Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight reward models tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs high-quality pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging preference dataset that supports both generative and discriminative reward modeling. We also introduce TRBenchBFCL, a benchmark built on the agent evaluation suite BFCL to evaluate RMs on tool calling tasks. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 17.94% higher accuracy, substantially outperforming frontier LLMs and RMs in pairwise reward judgments. Beyond training objectives, generative ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling while reducing output token usage by over 66%. Its support for downstream RL training further validates its practical utility. We release data to facilitate future research.
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
Yinger Zhang | Shutong Jiang | Renhao Li | Jianhong Tu | Yang Su | Lianghao Deng | Xudong Guo | ChenXu Lv | Junyang Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yinger Zhang | Shutong Jiang | Renhao Li | Jianhong Tu | Yang Su | Lianghao Deng | Xudong Guo | ChenXu Lv | Junyang Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.
Fico: Evaluating Vision-Language Models under Visual Fidelity and Compression at Scale
Jianhong Tu | Nicholas Crispino | Kyle Montgomery | Chenguang Wang | Dawn Song
Findings of the Association for Computational Linguistics: ACL 2026
Jianhong Tu | Nicholas Crispino | Kyle Montgomery | Chenguang Wang | Dawn Song
Findings of the Association for Computational Linguistics: ACL 2026
Visual text compression is an emerging paradigm for rendering text as images for processing by vision-language models (VLMs), enabling higher information density per context token. However, the robustness of VLMs under dense, text-based visual inputs remains unevaluated. We introduce Fico, a benchmark designed to assess VLM robustness across seven controlled variants of visual fidelity and information density. Fico spans documents of 8k to 64k tokens and includes three tasks of increasing semantic granularity: optical character recognition (OCR), needle-in-a-haystack (NIAH) retrieval, and visual question answering (VQA). Evaluating 13 general-purpose VLMs and 3 OCR-specialized models reveals three consistent trends: performance drops sharply under increased density or reduced resolution; cross-task transfer between OCR, NIAH, and VQA is limited; and VQA is comparatively robust because low-level details are lost before high-level semantics. By exposing failure modes that remain invisible under conventional VLM evaluations, Fico establishes a rigorous test-bed for visual text compression.
2025
MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models
Jianhong Tu | Zhuohao Ni | Nicholas Crispino | Zihao Yu | Michael Bendersky | Beliz Gunel | Ruoxi Jia | Xin Liu | Lingjuan Lyu | Dawn Song | Chenguang Wang
Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)
Jianhong Tu | Zhuohao Ni | Nicholas Crispino | Zihao Yu | Michael Bendersky | Beliz Gunel | Ruoxi Jia | Xin Liu | Lingjuan Lyu | Dawn Song | Chenguang Wang
Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)
We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.
LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
Zikai Xiao | Fei Huang | Jianhong Tu | Jianhui Wei | Wen Ma | Yuxuan Zhou | Jian Wu | Bowen Yu | Zuozhu Liu | Junyang Lin
Findings of the Association for Computational Linguistics: EMNLP 2025
Zikai Xiao | Fei Huang | Jianhong Tu | Jianhui Wei | Wen Ma | Yuxuan Zhou | Jian Wu | Bowen Yu | Zuozhu Liu | Junyang Lin
Findings of the Association for Computational Linguistics: EMNLP 2025
Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce LongWeave, which balance real-world and verifiable assessment with Target-Anchored Evaluation (TAE). TAE constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and anchors based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs show that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase. Dataset will be publicly available.
Search
Fix author
Co-authors
- Junyang Lin 3
- Nicholas Crispino 2
- Renhao Li 2
- Dawn Song 2
- Yang Su 2
- Chenguang Wang 2
- Hamid Alinejad-Rokny 1
- Michael Bendersky 1
- Lianghao Deng 1
- Beliz Gunel 1
- Xudong Guo 1
- Fei Huang 1
- Fei Huang 1
- Ruoxi Jia 1
- Shutong Jiang 1
- Yantao Liu 1
- Xin Liu 1
- Zuozhu Liu 1
- Chenxu Lv 1
- Lingjuan Lyu 1
- Wen Ma 1
- Kyle Montgomery 1
- Zhuohao Ni 1
- Jianhui Wei 1
- Derek F. Wong (黄辉) 1
- Jian Wu 1
- Zikai Xiao 1
- Min Yang 1
- Zihao Yu 1
- Bowen Yu 1
- Yinger Zhang 1
- Yuxuan Zhou 1