JurisBench: A Deep Benchmark for Assessing Large Language Models in Professional Legal Practice

Ziang Chen; Guannan Li; Fanlin Ji; Yipeng Kang; Jiaqi Li; Muhan Zhang; Yangtao Zhang; Li Tianjiao; Jiannan Wang; Xin Guo; Song-Chun Zhu; Bin Ling

JurisBench: A Deep Benchmark for Assessing Large Language Models in Professional Legal Practice

Ziang Chen, Guannan Li, Fanlin Ji, Yipeng Kang, Jiaqi Li, Muhan Zhang, Yangtao Zhang, Li Tianjiao, Jiannan Wang, Xin Guo, Song-Chun Zhu, Bin Ling

Abstract

Large Language Models (LLMs) have demonstrated strong cross-domain capabilities, yet their competence in specialized professional tasks remains underexamined. Existing legal benchmarks evaluate isolated tasks or exam-style questions, failing to capture the procedural interdependencies and adjudicative rigor inherent in professional practice. To bridge this gap, we construct JurisBench, a vertical, depth-oriented, domain-specific benchmark designed to evaluate LLMs across key stages of Chinese civil litigation. JurisBench introduces a Linear Depth Simulation track that mirrors the cognitive workflow of professional judges through four sequential, dependency-aware phases: Cause of Action prediction, Focus of Disputes identification, Rationale of the Judgment generation, and Result of the Judgment determination. Results reveal an “illusion of competence”: state-of-the-art models exhibit marked performance degradation in end-to-end pipelines due to cascading error propagation. We identify precise statutory grounding as a persistent bottleneck, highlighting a critical gap between fluent linguistic output and judicial reliability. JurisBench shifts evaluation from isolated legal knowledge to workflow-level task execution, providing a diagnostic framework for legal AI and a template for benchmark design in specialized domains.

Anthology ID:: 2026.acl-long.1666
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35994–36018
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1666/
DOI:
Bibkey:
Cite (ACL):: Ziang Chen, Guannan Li, Fanlin Ji, Yipeng Kang, Jiaqi Li, Muhan Zhang, Yangtao Zhang, Li Tianjiao, Jiannan Wang, Xin Guo, Song-Chun Zhu, and Bin Ling. 2026. JurisBench: A Deep Benchmark for Assessing Large Language Models in Professional Legal Practice. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35994–36018, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: JurisBench: A Deep Benchmark for Assessing Large Language Models in Professional Legal Practice (Chen et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1666.pdf
Checklist:: 2026.acl-long.1666.checklist.pdf

PDF Cite Search Checklist Fix data