2025
pdf
bib
abs
CitaLaw: Enhancing LLM with Citations in Legal Domain
Kepu Zhang
|
Weijie Yu
|
Sunhao Dai
|
Jun Xu
Findings of the Association for Computational Linguistics: ACL 2025
In this paper, we propose CitaLaw, the first benchmark designed to evaluate LLMs’ ability to produce legally sound responses with appropriate citations. CitaLaw features a diverse set of legal questions for both laypersons and practitioners, paired with a comprehensive corpus of law articles and precedent cases as a reference pool. This framework enables LLM-based systems to retrieve supporting citations from the reference corpus and align these citations with the corresponding sentences in their responses. Moreover, we introduce syllogism-inspired evaluation methods to assess the legal alignment between retrieved references and LLM-generated responses, as well as their consistency with user questions. Extensive experiments on 2 open-domain and 7 legal-specific LLMs demonstrate that integrating legal references substantially enhances response quality. Furthermore, our proposed syllogism-based evaluation method exhibits strong agreement with human judgments.
pdf
bib
abs
Legal Mathematical Reasoning with LLMs: Procedural Alignment through Two-Stage Reinforcement Learning
Kepu Zhang
|
Guofu Xie
|
Weijie Yu
|
Mingyue Xu
|
Xu Tang
|
Yaxin Li
|
Jun Xu
Findings of the Association for Computational Linguistics: EMNLP 2025
Legal mathematical reasoning is essential for applying large language models (LLMs) in high-stakes legal contexts, where outputs must be both mathematically accurate and procedurally compliant. However, existing legal LLMs lack structured numerical reasoning, and open-domain models, though capable of calculations, often overlook mandatory legal steps. To address this, we present LexNum, the first Chinese legal mathematical reasoning benchmark, covering three representative scenarios where each instance reflects legally grounded procedural flows. We further propose LexPam, a two-stage reinforcement learning framework for efficient legal reasoning training. Leveraging curriculum learning, we use a stronger teacher model to partition data into basic and challenging subsets. A lightweight 1.5B student model is then fine-tuned with Group Relative Policy Optimization, which avoids costly value networks and enables stable training from sparse, end-of-sequence rewards. The first stage improves accuracy and format; the second introduces a novel reward to guide procedural alignment via task-specific legal elements. Experiments show that existing models perform poorly on LexNum, while LexPam enhances both mathematical accuracy and legal coherence, and generalizes effectively across tasks and domains.
pdf
bib
abs
Beyond Guilt: Legal Judgment Prediction with Trichotomous Reasoning
Kepu Zhang
|
Haoyue Yang
|
Xu Tang
|
Weijie Yu
|
Jun Xu
Findings of the Association for Computational Linguistics: EMNLP 2025
In legal practice, judges apply the trichotomous dogmatics of criminal law, sequentially assessingthe elements of the offense, unlawfulness, and culpability to determine whether an individual’s conduct constitutes a crime. Although current legal large language models (LLMs) show promising accuracy in judgment prediction, they lack trichotomous reasoning capabilities due to the absence of an appropriate benchmark dataset, preventing them from predicting innocent outcomes. As a result, every input is automatically assigned a charge, limiting their practical utility in legal contexts. To bridge this gap, we introduce LJPIV, the first benchmark dataset for Legal Judgment Prediction with Innocent Verdicts. Adhering to the trichotomous dogmatics, we extend three widely-used legal datasets through LLM-based augmentation and manual verification. Our experiments with state-of-the-art legal LLMs and novel strategies that integrate trichotomous reasoning into zero-shot prompting and fine-tuning reveal: (1) current legal LLMs have significant room for improvement, with even the best models achieving an F1 score of less than 0.3 on LJPIV; and (2) our strategies notably enhance both in-domain and cross-domain judgment prediction accuracy, especially for cases resulting in an innocent verdict.
2024
pdf
bib
abs
Effective In-Context Example Selection through Data Compression
ZhongXiang Sun
|
Kepu Zhang
|
Haoyu Wang
|
Xiao Zhang
|
Jun Xu
Findings of the Association for Computational Linguistics: ACL 2024
In-context learning has been extensively validated in large language models. However, the mechanism and selection strategy for in-context example selection, which is a crucial ingredient in this approach, lacks systematic and in-depth research. In this paper, we propose a data compression approach to the selection of in-context examples. We introduce a two-stage method that can effectively choose relevant examples and retain sufficient information about the training dataset within the in-context examples. Our method shows a significant improvement of an average of 5.90% across five different real-world datasets using four language models.
pdf
bib
abs
Logic Rules as Explanations for Legal Case Retrieval
ZhongXiang Sun
|
Kepu Zhang
|
Weijie Yu
|
Haoyu Wang
|
Jun Xu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In this paper, we address the issue of using logic rules to explain the results from legal case retrieval. The task is critical to legal case retrieval because the users (e.g., lawyers or judges) are highly specialized and require the system to provide logic, faithful, and interpretable explanations before making legal decisions. Recently, research efforts have been made to learn explainable legal case retrieval models. However, these methods usually select rationales (key sentences) from the legal cases as explanations, failing to provide faithful and logicly correct explanations. In this paper, we propose Neural-Symbolic enhanced Legal Case Retrieval (NS-LCR), a framework that explicitly conducts reasoning on the matching of legal cases through learning case-level and law-level logic rules. The learned rules are then integrated into the retrieval process in a neuro-symbolic manner. Benefiting from the logic and interpretable nature of the logic rules, NS-LCR is equipped with built-in faithful explainability. We also show that NS-LCR is a model-agnostic framework that can be plug-in for multiple legal retrieval models. To demonstrate the superiority of NS-LCR, we extend the benchmarks of LeCaRD and ELAM with manually annotated logic rules and propose a new explainability measure based on Large Language Models (LLMs). Extensive experiments show that NS-LCR can achieve state-of-the-art ranking performances, and the empirical analysis also showed that NS-LCR is capable of providing faithful explanations for legal case retrieval.