Weixing Shen
2026
LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases
Yida Cai | Ranjuexiao Hu | Huiyuan Xie | Chenyang Li | Yun Liu | Yuxiao Ye | Zhenghao Liu | Weixing Shen | Zhiyuan Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yida Cai | Ranjuexiao Hu | Huiyuan Xie | Chenyang Li | Yun Liu | Yuxiao Ye | Zhenghao Liu | Weixing Shen | Zhiyuan Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Legal relations serve as an important analytical framework for dispute resolution in civil cases. However, legal relations in Chinese civil cases remain underexplored in the field of legal AI, largely due to the absence of comprehensive schemas. In this work, we first introduce a comprehensive schema for legal relations in civil cases, which contains a hierarchical taxonomy and definitions of arguments. Based on this schema, we formulate a legal relation extraction task and present **LexRel**, an expert-annotated benchmark for legal relation extraction in the Chinese civil law domain. We use **LexRel** to evaluate state-of-the-art large language models (LLMs) on legal relation extraction, showing that current LLMs exhibit significant limitations in accurately identifying civil legal relations. Furthermore, we demonstrate that explicitly incorporating information about legal relations leads to promising performance gains on other downstream legal AI tasks.
PLAWBENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice
Yuzhen Shi | Huanghai Liu | Yiran HU | Song Gaojie | Xu Xinran | Yubo Ma | Tianyi Tang | Li Zhang | Qingjing Chen | Feng Di | Wenbo Lv | Weiheng Wu | Kexin Yang | Sen Yang | Wei Wang | Rongyao Shi | Qiu Yuanyang | Yuemeng Qi | Zhang Jingwen | Sui Xiaoyu | Yifan Chen | Zhang Yi | An Yang | Bowen Yu | Dayiheng Liu | Junyang Lin | Weixing Shen | Bing Zhao | Charles L. A. Clarke | HU Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuzhen Shi | Huanghai Liu | Yiran HU | Song Gaojie | Xu Xinran | Yubo Ma | Tianyi Tang | Li Zhang | Qingjing Chen | Feng Di | Wenbo Lv | Weiheng Wu | Kexin Yang | Sen Yang | Wei Wang | Rongyao Shi | Qiu Yuanyang | Yuemeng Qi | Zhang Jingwen | Sui Xiaoyu | Yifan Chen | Zhang Yi | An Yang | Bowen Yu | Dayiheng Liu | Junyang Lin | Weixing Shen | Bing Zhao | Charles L. A. Clarke | HU Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model’s ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://anonymous.4open.science/r/PLawbench-B524/.
2025
JUREX-4E: Juridical Expert-Annotated Four-Element Knowledge Base for Legal Reasoning
Huanghai Liu | Quzhe Huang | Qingjing Chen | Yiran Hu | Jiayu Ma | Yun Liu | Weixing Shen | Yansong Feng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Huanghai Liu | Quzhe Huang | Qingjing Chen | Yiran Hu | Jiayu Ma | Yun Liu | Weixing Shen | Yansong Feng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
In recent years, Large Language Models (LLMs) have been widely applied to legal tasks. To enhance their understanding of legal texts and improve reasoning accuracy, a promising approach is to incorporate legal theories. One of the most widely adopted theories is the Four-Element Theory (FET), which defines the crime constitution through four elements: Subject, Object, Subjective Aspect, and Objective Aspect. While recent work has explored prompting LLMs to follow FET, our evaluation demonstrates that LLM-generated four-elements are often incomplete and less representative, limiting their effectiveness in legal reasoning.To address these issues, we present JUREX-4E, an expert-annotated four-element knowledge base covering 155 criminal charges. The annotations follow a progressive hierarchical framework grounded in legal source validity and incorporate diverse interpretive methods to ensure precision and authority. We evaluate JUREX-4E on the Similar Charge Disambiguation task and apply it to Legal Case Retrieval. Experimental results validate the high quality of JUREX-4E and its substantial impact on downstream legal tasks, underscoring its potential for advancing legal AI applications. The dataset and code are available at: https://github.com/THUlawtech/JUREX
2024
STARD: A Chinese Statute Retrieval Dataset Derived from Real-life Queries by Non-professionals
Weihang Su | Yiran Hu | Anzhe Xie | Qingyao Ai | Quezi Bing | Ning Zheng | Yun Liu | Weixing Shen | Yiqun Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
Weihang Su | Yiran Hu | Anzhe Xie | Qingyao Ai | Quezi Bing | Ning Zheng | Yun Liu | Weixing Shen | Yiqun Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
Statute retrieval aims to find relevant statutory articles for specific queries. This process is the basis of a wide range of legal applications such as legal advice, automated judicial decisions, legal document drafting, etc. Existing statute retrieval benchmarks emphasize formal and professional queries from sources like bar exams and legal case documents, thereby neglecting non-professional queries from the general public, which often lack precise legal terminology and references. To address this gap, we introduce the STAtute Retrieval Dataset (STARD), a Chinese dataset comprising 1,543 query cases collected from real-world legal consultations and 55,348 candidate statutory articles. Unlike existing statute retrieval datasets, which primarily focus on professional legal queries, STARD captures the complexity and diversity of real queries from the general public. Through a comprehensive evaluation of various retrieval baselines, we reveal that existing retrieval approaches all fall short of these real queries issued by non-professional users. The best method only achieves a Recall@100 of 0.907, suggesting the necessity for further exploration and additional research in this area.
2023
The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation
Hao Peng | Xiaozhi Wang | Feng Yao | Kaisheng Zeng | Lei Hou | Juanzi Li | Zhiyuan Liu | Weixing Shen
Findings of the Association for Computational Linguistics: ACL 2023
Hao Peng | Xiaozhi Wang | Feng Yao | Kaisheng Zeng | Lei Hou | Juanzi Li | Zhiyuan Liu | Weixing Shen
Findings of the Association for Computational Linguistics: ACL 2023
Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of recent papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OmniEvent, which can be obtained from https://github.com/THU-KEG/OmniEvent.
2022
LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
Feng Yao | Chaojun Xiao | Xiaozhi Wang | Zhiyuan Liu | Lei Hou | Cunchao Tu | Juanzi Li | Yun Liu | Weixing Shen | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2022
Feng Yao | Chaojun Xiao | Xiaozhi Wang | Zhiyuan Liu | Lei Hou | Cunchao Tu | Juanzi Li | Yun Liu | Weixing Shen | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2022
Recognizing facts is the most fundamental step in making judgments, hence detecting events in the legal documents is important to legal case analysis tasks. However, existing Legal Event Detection (LED) datasets only concern incomprehensive event types and have limited annotated data, which restricts the development of LED methods and their downstream applications. To alleviate these issues, we present LEVEN a large-scale Chinese LEgal eVENt detection dataset, with 8,116 legal documents and 150,977 human-annotated event mentions in 108 event types. Not only charge-related events, LEVEN also covers general events, which are critical for legal case understanding but neglected in existing LED datasets. To our knowledge, LEVEN is the largest LED dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of LED methods. The results of extensive experiments indicate that LED is challenging and needs further effort. Moreover, we simply utilize legal events as side information to promote downstream applications. The method achieves improvements of average 2.2 points precision in low-resource judgment prediction, and 1.5 points mean average precision in unsupervised case retrieval, which suggests the fundamentality of LED. The source code and dataset can be obtained from https://github.com/thunlp/LEVEN.
Search
Fix author
Co-authors
- Yun Liu 4
- Yiran HU 3
- Qingjing Chen 2
- Lei Hou 2
- Juanzi Li 2
- Huanghai Liu 2
- Zhiyuan Liu 2
- Xiaozhi Wang 2
- Feng Yao 2
- Qingyao Ai 1
- Quezi Bing 1
- Yida Cai 1
- Yifan Chen 1
- Charles L. A. Clarke 1
- Feng Di 1
- Yansong Feng 1
- Song Gaojie 1
- Ranjuexiao Hu 1
- Quzhe Huang 1
- Zhang Jingwen 1
- Chenyang Li (李晨阳) 1
- Junyang Lin 1
- Dayiheng Liu 1
- Yiqun Liu 1
- Zhenghao Liu (刘正皓) 1
- Zhiyuan Liu 1
- Wenbo Lv 1
- Jiayu Ma 1
- Yubo Ma 1
- Hao Peng 1
- Yuemeng Qi 1
- Rongyao Shi 1
- Yuzhen Shi 1
- Weihang Su 1
- Maosong Sun (孙茂松) 1
- Tianyi Tang 1
- Cunchao Tu 1
- Wei Wang 1
- HU Wei 1
- Weiheng Wu 1
- Chaojun Xiao 1
- Sui Xiaoyu 1
- Anzhe Xie 1
- Huiyuan Xie 1
- Xu Xinran 1
- An Yang 1
- Kexin Yang 1
- Sen Yang 1
- Yuxiao Ye 1
- Zhang Yi 1
- Bowen Yu 1
- Qiu Yuanyang 1
- Kaisheng Zeng 1
- Li Zhang 1
- Bing Zhao 1
- Ning Zheng 1