Shengbin Yue
2026
Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments
Zheng Jia | Shengbin Yue | Wei Chen | Siyuan Wang | Yidong Liu | Zejun Li | Yun Song | Zhongyu Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zheng Jia | Shengbin Yue | Wei Chen | Siyuan Wang | Yidong Liu | Zejun Li | Yun Song | Zhongyu Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The gap between existing benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices at three levels of environmental complexity. We further introduce J1-EVAL, a dual-metric evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model is below 60% overall performance . These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.
2025
HAF-RM: A Hybrid Alignment Framework for Reward Model Training
Shujun Liu | Xiaoyu Shen | Yuhang Lai | Siyuan Wang | Shengbin Yue | Zengfeng Huang | Xuanjing Huang | Zhongyu Wei
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shujun Liu | Xiaoyu Shen | Yuhang Lai | Siyuan Wang | Shengbin Yue | Zengfeng Huang | Xuanjing Huang | Zhongyu Wei
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards.In this paper, we propose a hybrid alignment framework **HAF-RM** for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level.Experiment results on five datasets sufficiently show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model.By decoupling the reward modeling procedure and incorporating hybrid supervision, our **HAF-RM** framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at [https://haf-rm.github.io](https://haf-rm.github.io).
Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction
Shengbin Yue | Ting Huang | Zheng Jia | Siyuan Wang | Shujun Liu | Yun Song | Xuanjing Huang | Zhongyu Wei
Findings of the Association for Computational Linguistics: NAACL 2025
Shengbin Yue | Ting Huang | Zheng Jia | Siyuan Wang | Shujun Liu | Yun Song | Xuanjing Huang | Zhongyu Wei
Findings of the Association for Computational Linguistics: NAACL 2025
Large Language Models (LLMs) have significantly advanced legal intelligence, but the scarcity of scenario data impedes the progress toward interactive legal scenarios. This paper introduces a Multi-agent Legal Simulation Driver (MASER) to scalably generate synthetic data by simulating interactive legal scenarios. Leveraging real-legal case sources, MASER ensures the consistency of legal attributes between participants and introduces a supervisory mechanism to align participants’ characters and behaviors as well as addressing distractions. A Multi-stage Interactive Legal Evaluation (MILE) benchmark is further constructed to evaluate LLMs’ performance in dynamic legal scenarios. Extensive experiments confirm the effectiveness of our framework.