Qiyao Wang
2026
Beyond Quantity: Trajectory Diversity Scaling for Code Agents
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
As code large language models (LLMs) evolve into tool-interactive agents via the Model Context Protocol (MCP), their generalization is increasingly limited by low-quality synthetic data and the diminishing returns of quantity scaling; moreover, quantity-centric scaling exhibits an early bottleneck that underutilizes trajectory data. We propose TDScaling, a Trajectory Diversity Scaling-based data synthesis framework for code agents that scales performance through diversity rather than raw volume. Moreover, TDScaling is more data-efficient: under a fixed training budget, increasing trajectory diversity yields larger gains than adding more trajectories, improving the performance-cost trade-off for agent training. TDScaling integrates four innovations: (1) a Business Cluster mechanism that captures real-service logical dependencies; (2) a Blueprint-driven multi-agent paradigm that enforces trajectory coherence; (3) an adaptive evolution mechanism that steers synthesis toward long-tail scenarios using Domain Entropy, Reasoning Mode Entropy, and Cumulative Action Complexity to prevent mode collapse; and (4) a sandboxed code tool that mitigates catastrophic forgetting of intrinsic coding capabilities. Experiments on general tool-use benchmarks (BFCL, 𝜏2-Bench) and code agent tasks (RebenchT, CodeCI, BIRD) demonstrate a win-win outcome: TDScaling improves both tool-use generalization and inherent coding proficiency. Crucially, we show that trajectory diversity scaling attains a substantially higher performance ceiling than quantity scaling, establishing a resource-efficient paradigm for training robust code agents under data bottlenecks.
Towards IP Intelligence: Benchmarking Large Language Models on Intellectual Property Knowledge and Practice
Qiyao Wang | Guhong Chen | Hongbo Wang | Huaren Liu | Minghui Zhu | Zhifei Qin | Li Linwei | Yilin Yue | Shiqiang Wang | Jiayan Li | Wu Yihang | Ziqiang Liu | Longze Chen | Run Luo | Liyang Fan | Jiaming Li | Lei Zhang | Kan Xu | Hamid Alinejad-Rokny | Chengming Li | Shiwen Ni | Yuan Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Qiyao Wang | Guhong Chen | Hongbo Wang | Huaren Liu | Minghui Zhu | Zhifei Qin | Li Linwei | Yilin Yue | Shiqiang Wang | Jiayan Li | Wu Yihang | Ziqiang Liu | Longze Chen | Run Luo | Liyang Fan | Jiaming Li | Lei Zhang | Kan Xu | Hamid Alinejad-Rokny | Chengming Li | Shiwen Ni | Yuan Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Intellectual Property (IP) is a highly specialized domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. Recent advancements in LLMs have demonstrated their potential to handle IP tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce **IPBench**, the first comprehensive IP task taxonomy and a large-scale bilingual benchmark encompassing **8 IP mechanisms and 20 distinct tasks**, designed to evaluate LLMs in real-world IP practice. We benchmark **19 main LLMs**, ranging from general purpose to domain-specific, including chat-oriented and reasoning-focused models, under zero-shot, few-shot, and chain-of-thought settings. Our results show that even the top-performing model, DeepSeek-V3, achieves only 75.8% accuracy, indicating significant room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. To foster future research, we publicly release IPBench, and will expand it with additional tasks to better reflect real-world complexities and support model advancements in the IP domain. We provide the data, code in the supplementary materials.
Search
Fix author
Co-authors
- Hamid Alinejad-Rokny 2
- Guhong Chen 2
- Shiwen Ni 2
- Min Yang 2
- Ahmadreza Argha 1
- Guangxu Chen 1
- Longze Chen 1
- Liyang Fan 1
- Feiteng Fang 1
- Cheng Fu 1
- Qi Han 1
- Zhihong Huang 1
- Binhua Li 1
- Yongbin Li 1
- Jiayan Li 1
- Jiaming Li 1
- Chengming Li 1
- Yuan Lin 1
- Li Linwei 1
- Huaren Liu 1
- Ziqiang Liu 1
- Run Luo 1
- Zhifei Qin 1
- Qiang Qu 1
- Chenghao Sun 1
- Hongbo Wang 1
- Shiqiang Wang 1
- ChaoPeng Wei 1
- HU Wei 1
- Xander Xu 1
- Kan Xu 1
- Wu Yihang 1
- Yilin Yue 1
- Lei Zhang 1
- Bing Zhao 1
- Minghui Zhu 1