Qingjing Chen
2026
PLAWBENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice
Yuzhen Shi | Huanghai Liu | Yiran HU | Song Gaojie | Xu Xinran | Yubo Ma | Tianyi Tang | Li Zhang | Qingjing Chen | Feng Di | Wenbo Lv | Weiheng Wu | Kexin Yang | Sen Yang | Wei Wang | Rongyao Shi | Qiu Yuanyang | Yuemeng Qi | Zhang Jingwen | Sui Xiaoyu | Yifan Chen | Zhang Yi | An Yang | Bowen Yu | Dayiheng Liu | Junyang Lin | Weixing Shen | Bing Zhao | Charles L. A. Clarke | HU Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuzhen Shi | Huanghai Liu | Yiran HU | Song Gaojie | Xu Xinran | Yubo Ma | Tianyi Tang | Li Zhang | Qingjing Chen | Feng Di | Wenbo Lv | Weiheng Wu | Kexin Yang | Sen Yang | Wei Wang | Rongyao Shi | Qiu Yuanyang | Yuemeng Qi | Zhang Jingwen | Sui Xiaoyu | Yifan Chen | Zhang Yi | An Yang | Bowen Yu | Dayiheng Liu | Junyang Lin | Weixing Shen | Bing Zhao | Charles L. A. Clarke | HU Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model’s ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://anonymous.4open.science/r/PLawbench-B524/.
2025
JUREX-4E: Juridical Expert-Annotated Four-Element Knowledge Base for Legal Reasoning
Huanghai Liu | Quzhe Huang | Qingjing Chen | Yiran Hu | Jiayu Ma | Yun Liu | Weixing Shen | Yansong Feng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Huanghai Liu | Quzhe Huang | Qingjing Chen | Yiran Hu | Jiayu Ma | Yun Liu | Weixing Shen | Yansong Feng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
In recent years, Large Language Models (LLMs) have been widely applied to legal tasks. To enhance their understanding of legal texts and improve reasoning accuracy, a promising approach is to incorporate legal theories. One of the most widely adopted theories is the Four-Element Theory (FET), which defines the crime constitution through four elements: Subject, Object, Subjective Aspect, and Objective Aspect. While recent work has explored prompting LLMs to follow FET, our evaluation demonstrates that LLM-generated four-elements are often incomplete and less representative, limiting their effectiveness in legal reasoning.To address these issues, we present JUREX-4E, an expert-annotated four-element knowledge base covering 155 criminal charges. The annotations follow a progressive hierarchical framework grounded in legal source validity and incorporate diverse interpretive methods to ensure precision and authority. We evaluate JUREX-4E on the Similar Charge Disambiguation task and apply it to Legal Case Retrieval. Experimental results validate the high quality of JUREX-4E and its substantial impact on downstream legal tasks, underscoring its potential for advancing legal AI applications. The dataset and code are available at: https://github.com/THUlawtech/JUREX
Search
Fix author
Co-authors
- Yiran HU 2
- Huanghai Liu 2
- Weixing Shen 2
- Yifan Chen 1
- Charles L. A. Clarke 1
- Feng Di 1
- Yansong Feng 1
- Song Gaojie 1
- Quzhe Huang 1
- Zhang Jingwen 1
- Junyang Lin 1
- Dayiheng Liu 1
- Yun Liu 1
- Wenbo Lv 1
- Yubo Ma 1
- Jiayu Ma 1
- Yuemeng Qi 1
- Yuzhen Shi 1
- Rongyao Shi 1
- Tianyi Tang 1
- Wei Wang 1
- HU Wei 1
- Weiheng Wu 1
- Sui Xiaoyu 1
- Xu Xinran 1
- Kexin Yang 1
- Sen Yang 1
- An Yang 1
- Zhang Yi 1
- Bowen Yu 1
- Qiu Yuanyang 1
- Li Zhang 1
- Bing Zhao 1