HU Wei
2026
PLAWBENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice
Yuzhen Shi | Huanghai Liu | Yiran HU | Song Gaojie | Xu Xinran | Yubo Ma | Tianyi Tang | Li Zhang | Qingjing Chen | Feng Di | Wenbo Lv | Weiheng Wu | Kexin Yang | Sen Yang | Wei Wang | Rongyao Shi | Qiu Yuanyang | Yuemeng Qi | Zhang Jingwen | Sui Xiaoyu | Yifan Chen | Zhang Yi | An Yang | Bowen Yu | Dayiheng Liu | Junyang Lin | Weixing Shen | Bing Zhao | Charles L. A. Clarke | HU Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuzhen Shi | Huanghai Liu | Yiran HU | Song Gaojie | Xu Xinran | Yubo Ma | Tianyi Tang | Li Zhang | Qingjing Chen | Feng Di | Wenbo Lv | Weiheng Wu | Kexin Yang | Sen Yang | Wei Wang | Rongyao Shi | Qiu Yuanyang | Yuemeng Qi | Zhang Jingwen | Sui Xiaoyu | Yifan Chen | Zhang Yi | An Yang | Bowen Yu | Dayiheng Liu | Junyang Lin | Weixing Shen | Bing Zhao | Charles L. A. Clarke | HU Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model’s ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://anonymous.4open.science/r/PLawbench-B524/.
V-GameGym: Visual Game Generation for Code Large Language Models
Wei Zhang | Jian Yang | Renshuai Tao | Linzheng Chai | Shuyue Guo | Jiajun Wu | Xiaoming Chen | Ganqu Cui | Ning Ding | Xander Xu | HU Wei | Bowen Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Wei Zhang | Jian Yang | Renshuai Tao | Linzheng Chai | Shuyue Guo | Jiajun Wu | Xiaoming Chen | Ganqu Cui | Ning Ding | Xander Xu | HU Wei | Bowen Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.
Beyond Quantity: Trajectory Diversity Scaling for Code Agents
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
As code large language models (LLMs) evolve into tool-interactive agents via the Model Context Protocol (MCP), their generalization is increasingly limited by low-quality synthetic data and the diminishing returns of quantity scaling; moreover, quantity-centric scaling exhibits an early bottleneck that underutilizes trajectory data. We propose TDScaling, a Trajectory Diversity Scaling-based data synthesis framework for code agents that scales performance through diversity rather than raw volume. Moreover, TDScaling is more data-efficient: under a fixed training budget, increasing trajectory diversity yields larger gains than adding more trajectories, improving the performance-cost trade-off for agent training. TDScaling integrates four innovations: (1) a Business Cluster mechanism that captures real-service logical dependencies; (2) a Blueprint-driven multi-agent paradigm that enforces trajectory coherence; (3) an adaptive evolution mechanism that steers synthesis toward long-tail scenarios using Domain Entropy, Reasoning Mode Entropy, and Cumulative Action Complexity to prevent mode collapse; and (4) a sandboxed code tool that mitigates catastrophic forgetting of intrinsic coding capabilities. Experiments on general tool-use benchmarks (BFCL, 𝜏2-Bench) and code agent tasks (RebenchT, CodeCI, BIRD) demonstrate a win-win outcome: TDScaling improves both tool-use generalization and inherent coding proficiency. Crucially, we show that trajectory diversity scaling attains a substantially higher performance ceiling than quantity scaling, establishing a resource-efficient paradigm for training robust code agents under data bottlenecks.
Search
Fix author
Co-authors
- Xander Xu 2
- Bing Zhao 2
- Hamid Alinejad-Rokny 1
- Ahmadreza Argha 1
- Linzheng Chai 1
- Qingjing Chen 1
- Yifan Chen 1
- Xiaoming Chen 1
- Guhong Chen 1
- Guangxu Chen 1
- Charles L. A. Clarke 1
- Ganqu Cui 1
- Feng Di 1
- Ning Ding 1
- Feiteng Fang 1
- Cheng Fu 1
- Song Gaojie 1
- Shuyue Guo 1
- Yiran HU 1
- Qi Han 1
- Zhihong Huang 1
- Zhang Jingwen 1
- Binhua Li 1
- Yongbin Li 1
- Junyang Lin 1
- Huanghai Liu 1
- Dayiheng Liu 1
- Wenbo Lv 1
- Yubo Ma 1
- Shiwen Ni 1
- Yuemeng Qi 1
- Qiang Qu 1
- Weixing Shen 1
- Yuzhen Shi 1
- Rongyao Shi 1
- Chenghao Sun 1
- Tianyi Tang 1
- Renshuai Tao 1
- Wei Wang 1
- Qiyao Wang 1
- ChaoPeng Wei 1
- Weiheng Wu 1
- Jiajun Wu 1
- Sui Xiaoyu 1
- Xu Xinran 1
- Kexin Yang 1
- Sen Yang 1
- An Yang 1
- Jian Yang 1
- Min Yang 1
- Zhang Yi 1
- Bowen Yu 1
- Qiu Yuanyang 1
- Li Zhang 1
- Wei Zhang 1
- Bowen Zhou 1