Jiannan Wang

2026

Large Language Models (LLMs) have demonstrated strong cross-domain capabilities, yet their competence in specialized professional tasks remains underexamined. Existing legal benchmarks evaluate isolated tasks or exam-style questions, failing to capture the procedural interdependencies and adjudicative rigor inherent in professional practice. To bridge this gap, we construct JurisBench, a vertical, depth-oriented, domain-specific benchmark designed to evaluate LLMs across key stages of Chinese civil litigation. JurisBench introduces a Linear Depth Simulation track that mirrors the cognitive workflow of professional judges through four sequential, dependency-aware phases: Cause of Action prediction, Focus of Disputes identification, Rationale of the Judgment generation, and Result of the Judgment determination. Results reveal an “illusion of competence”: state-of-the-art models exhibit marked performance degradation in end-to-end pipelines due to cascading error propagation. We identify precise statutory grounding as a persistent bottleneck, highlighting a critical gap between fluent linguistic output and judicial reliability. JurisBench shifts evaluation from isolated legal knowledge to workflow-level task execution, providing a diagnostic framework for legal AI and a template for benchmark design in specialized domains.

Co-authors

Jiaqi Li 1

Venues

ACL1

Fix author