Xander Xu
2026
V-GameGym: Visual Game Generation for Code Large Language Models
Wei Zhang | Jian Yang | Renshuai Tao | Linzheng Chai | Shuyue Guo | Jiajun Wu | Xiaoming Chen | Ganqu Cui | Ning Ding | Xander Xu | HU Wei | Bowen Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Wei Zhang | Jian Yang | Renshuai Tao | Linzheng Chai | Shuyue Guo | Jiajun Wu | Xiaoming Chen | Ganqu Cui | Ning Ding | Xander Xu | HU Wei | Bowen Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.
Beyond Quantity: Trajectory Diversity Scaling for Code Agents
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
As code large language models (LLMs) evolve into tool-interactive agents via the Model Context Protocol (MCP), their generalization is increasingly limited by low-quality synthetic data and the diminishing returns of quantity scaling; moreover, quantity-centric scaling exhibits an early bottleneck that underutilizes trajectory data. We propose TDScaling, a Trajectory Diversity Scaling-based data synthesis framework for code agents that scales performance through diversity rather than raw volume. Moreover, TDScaling is more data-efficient: under a fixed training budget, increasing trajectory diversity yields larger gains than adding more trajectories, improving the performance-cost trade-off for agent training. TDScaling integrates four innovations: (1) a Business Cluster mechanism that captures real-service logical dependencies; (2) a Blueprint-driven multi-agent paradigm that enforces trajectory coherence; (3) an adaptive evolution mechanism that steers synthesis toward long-tail scenarios using Domain Entropy, Reasoning Mode Entropy, and Cumulative Action Complexity to prevent mode collapse; and (4) a sandboxed code tool that mitigates catastrophic forgetting of intrinsic coding capabilities. Experiments on general tool-use benchmarks (BFCL, 𝜏2-Bench) and code agent tasks (RebenchT, CodeCI, BIRD) demonstrate a win-win outcome: TDScaling improves both tool-use generalization and inherent coding proficiency. Crucially, we show that trajectory diversity scaling attains a substantially higher performance ceiling than quantity scaling, establishing a resource-efficient paradigm for training robust code agents under data bottlenecks.