Proceedings of the Natural Legal Language Processing Workshop 2025

Nikolaos Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goanță, Daniel Preoțiuc-Pietro, Gerasimos Spanakis (Editors)


Anthology ID:
2025.nllp-1
Month:
November
Year:
2025
Address:
Suzhou, China
Venues:
NLLP | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.nllp-1/
DOI:
ISBN:
979-8-89176-338-8
Bib Export formats:
BibTeX
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.nllp-1.pdf

pdf bib
Proceedings of the Natural Legal Language Processing Workshop 2025
Nikolaos Aletras | Ilias Chalkidis | Leslie Barrett | Cătălina Goanță | Daniel Preoțiuc-Pietro | Gerasimos Spanakis

pdf bib
Tracing Definitions: Lessons from Alliance Contracts in the Biopharmaceutical Industry
Maximilian Kreutner | Doerte Leusmann | Florian Lemmerich | Carolin Haeussler

Definitions in alliance contracts play a critical role in shaping agreements, yet they can also lead to costly misunderstandings. This is exemplified by the multimillion-dollar AstraZeneca-Euopean Commission (EC) dispute, where the interpretation of ‘best reasonable effort’ became the focal point of contention. In this interdisciplinary study, we leverage natural language processing (NLP) to systematically analyze patterns in the definitions included in alliance contracts. More specifically, we categorize the content of definitions into topics, identify common terms versus outliers that are semantically dissimilar and infrequently used, and track how definitions evolve over time. Analyzing a dataset of 380,131 definitions from 12,468 alliance contracts in the biopharmaceutical industry, we distinguish that definitions span legal, technological, and social topics, with social terms showing the highest dissimilarity across contracts. Using dynamic topic modeling, we explore how the content of definitions has shifted over two decades (2000–2020) and identify prevalent trends suggesting that contractual definitions reflect broader economic contexts. Notably, our results reveal that the AstraZeneca-EC dispute arose from an outlier—a highly unusual definition—that could have been flagged using NLP. Overall, these findings highlight the potential of data-driven approaches to uncover patterns in alliance contracts.

pdf bib
The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets
Shenzhe Zhu | Jiao Sun | Yi Nian | Tobin South | Alex Pentland | Jiaxin Pei

AI agents are increasingly used in consumer-facing applications to assist with tasks such as product search, negotiation, and transaction execution. In this paper, we investigate a future setting where both consumers and merchants authorize AI agents to automate the negotiations and transactions in consumer settings. We aim to address two questions: (1) Do different LLM agents exhibit varying performances when making deals on behalf of their users? (2) What are the potential risks when we use AI agents to fully automate negotiations and deal-making in consumer settings? We designed an experimental framework to evaluate AI agents’ capabilities and performance in real-world negotiation and transaction scenarios, and experimented with a range of open-source and closed-source LLMs. Our analysis reveals that deal-making with LLM agents in consumer settings is an inherently imbalanced game: different AI agents have large disparities in obtaining the best deals for their users. Furthermore, we found that LLMs’ behavioral anomaly might lead to financial loss when deployed in real-world decision-making scenarios, such as overspending or making unreasonable deals. Our findings highlight that while automation can enhance transactional efficiency, it also poses nontrivial risks to consumer markets. Users should be careful when delegating business decisions to LLM agents.

pdf bib
Towards Reliable Retrieval in RAG Systems for Large Legal Datasets
Markus Reuter | Tobias Lingenberg | Ruta Liepina | Francesca Lagioia | Marco Lippi | Giovanni Sartor | Andrea Passerini | Burcu Sayin

Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.

pdf bib
Translating Tax Law to Code with LLMs: A Benchmark and Evaluation Framework
Gabriele Lorenzo | Aldo Pietromatera | Nils Holzenberger

Catala is a domain-specific programming language for tax law, meant to facilitate the translation of legal text into executable computer code, thanks to a syntax close to that of legal language and reasoning. Legal statutes paired with their Catala translation have been published online periodically, but manual translation remains labor-intensive. In this work, we develop a benchmark for the evaluation of Catala code generation from legal text, including a training set to fine-tune Large Language Models. To assess the quality of the generated code, we introduce an evaluation framework extending current metrics for code generation. Our experiments with few-shot learning, as well as fine-tuned models, suggest the feasibility of automating legal code generation, and contrast with prior attempts to translate legal language into a formal representation.

pdf bib
Beyond the Haystack: Sensitivity to Context in Legal Reference Recall
Eric Xia | Karthik Srikumar | Keshav Karthik | Advaith Renjith | Ashwinee Panda

Reference retrieval is critical for many applications in the legal domain, for instance in determining which case texts support a particular claim. However, existing benchmarking methods do not rigorously enable evaluation of recall capabilities in previously unseen contexts. We develop an evaluation framework from U.S. court opinions which ensures models have no prior knowledge of case results or context. Applying our framework, we identify an consistent gap across models and tasks between traditional needle-in-a-haystack retrieval and actual performance in legal recall. Our work shows that standard needle-in-a-haystack benchmarks consistently overestimate recall performance in the legal domain. By isolating the causes of performance degradation to contextual informativity rather than distributional differences, our findings highlight the need for specialized testing in reference-critical applications, and establish an evaluation framework for improving retrieval across informativity levels.

pdf bib
Machine Unlearning of Personally Identifiable Information in Large Language Models
Dan Parii | Thomas van Osch | Chang Sun

Pretrained LLMs are trained on massive web-scale datasets, which often contain personally identifiable information (PII), raising serious legal and ethical concerns. A key research challenge is how to effectively unlearn PII without degrading the model’s utility or leaving implicit knowledge that can be exploited.This study proposes UnlearnPII, a benchmark designed to evaluate the effectiveness of PII unlearning methods, addressing limitations in existing metrics that overlook implicit knowledge and assess all tokens equally. Our benchmark focuses on detecting PII leakage, testing model robustness through obfuscated prompts and jailbreak attacks over different domains, while measuring utility and retention quality.To advance practical solutions, we propose a new PII unlearning method - PERMUtok. By applying token-level noise, we achieve 1) simplified integration into existing workflows, 2) improved retention and output quality, while maintaining unlearning effectiveness. The code is open-source and publicly available.

pdf bib
Modeling Motivated Reasoning in Law: Evaluating Strategic Role Conditioning in LLM Summarization
Eunjung Cho | Alexander Miserlis Hoyle | Yoan Hermstrüwer

Large Language Models (LLMs) are increasingly used to generate user-tailored summaries, adapting outputs to specific stakeholders. In legal contexts, this raises important questions about motivated reasoning — how models strategically frame information to align with a stakeholder’s position within the legal system. Building on theories of legal realism and recent trends in legal practice, we investigate how LLMs respond to prompts conditioned on different legal roles (e.g., judges, prosecutors, attorneys) when summarizing judicial decisions. We introduce an evaluation framework grounded in legal fact and reasoning inclusion, also considering favorability towards stakeholders. Our results show that even when prompts include balancing instructions, models exhibit selective inclusion patterns that reflect role-consistent perspectives. These findings raise broader concerns about how similar alignment may emerge as LLMs begin to infer user roles from prior interactions or context, even without explicit role instructions. Our results underscore the need for role-aware evaluation of LLM summarization behavior in high-stakes legal settings.

pdf bib
Label-Free Distinctiveness: Building a Continuous Trademark Scale via Synthetic Anchors
Huihui Xu | Kevin D. Ashley

Trademark law protects distinctive marks that are able to identify and distinguish goods or services. The Abercrombie spectrum classifies marks from generic to fanciful based on distinctiveness. The Abercrombie spectrum employs hard buckets while the real world ofbranding rarely falls into neat bins: marks often hover at the blurry border between “descriptive” and “suggestive” for example. Byrequiring trademark examiners or researchers to pick one of the five buckets, one loses useful information where the lines get blurry. Sohard boundaries obscure valuable gradations of meaning. In this work, we explore creating a continuous ruler of distinctiveness asa complementary diagnostic tool to the original buckets. The result is a label-free ladder, where every mark, real or synthetic, gets a real-valued score. These continuous scores reveal subtle distinctions among marks and provide interpretable visualizations that help practitioners understand where a mark falls relative to established anchors. Testing with 95 expert-classified trademark examples achieves a Spearman’s ρ = 0.718 and Pearson’s r = 0.724 against human labels, while offering intuitive visualizations on the continuous spectrum. Ademo can be found at https://distinctiveness-ruler-demo.streamlit.app/.

pdf bib
Copyright Infringement by Large Language Models in the EU: Misalignment, Safeguards, and the Path Forward
Noah Scharrenberg | Chang Sun

This position paper argues that European copyright law has struggled to keep pace with the development of large language models (LLMs), possibly creating a fundamental epistemic misalignment: copyright compliance relies on qualitative, context-dependent standards, while LLM development is governed by quantitative, proactive metrics. This gap means that technical safeguards, by themselves, may be insufficient to reliably demonstrate legal compliance. We identify several practical limitations in the existing EU legal frameworks, including ambiguous “lawful access” rules, fragmented opt-outs, and vague disclosure duties. We then discuss technical measures such as provenance-first data governance, machine unlearning for post-hoc removal, and synthetic data generation, showing their promise but also their limits.Finally, we propose a path forward grounded in legal-technical co-design, suggesting directions for standardising machine-readable opt-outs, disclosure templates, clarifying core legal terms, and developing legally-informed benchmarks and evidence standards. We conclude that such an integrated framework is essential to make compliance auditable, thus protected creators’ rights while enabling responsible AI innovation at scale.

pdf bib
Grounded Answers from Multi-Passage Regulations: Learning-to-Rank for Regulatory RAG
Tuba Gokhan | Ted Briscoe

Regulatory compliance questions often require aggregating evidence from multiple, interrelated sections of long, complex documents. To support question-answering (QA) in this setting, we introduce ObliQA-MP, a dataset for multi-passage regulatory QA, extending the earlier ObliQA benchmark (CITATION), and improve evidence quality with an LLM–based validation step that filters out ~20% of passages missed by prior natural language inference (NLI) based filtering. Our benchmarks show a notable performance drop from single- to multi-passage retrieval, underscoring the challenges of semantic overlap and structural complexity in regulatory texts. To address this, we propose a feature-based learning-to-rank (LTR) framework that integrates lexical, semantic, and graph-derived information, achieving consistent gains over dense and hybrid baselines. We further add a lightweight score-based filter to trim noisy tails and an obligation-centric prompting technique. On ObliQA-MP, LTR improves retrieval (Recall@10/MAP@10/nDCG@10) over dense, hybrid, and fusion baselines. Our generation approach, based on domain-specific filtering plus prompting, achieves strong scores using the RePAS metric (CITATION) on ObliQA-MP, producing faithful, citation-grounded answers. Together, ObliQA-MP and our validation and RAG systems offer a stronger benchmark and a practical recipe for grounded, citation-controlled QA in regulatory domains.

pdf bib
NyayGraph: A Knowledge Graph Enhanced Approach for Legal Statute Identification in Indian Law using Large Language Models
Siddharth Shukla | Tanuj Tyagi | Abhay Singh Bisht | Ashish Sharma | Basant Agarwal

One of the first steps in the judicial processis finding the applicable statutes/laws basedon the facts of the current situation. Manu-ally searching through multiple legislation andlaws to find the relevant statutes can be time-consuming, making the Legal Statute Identi-fication (LSI) task important for reducing theworkload, helping improve the efficiency ofthe judicial system. To address this gap, wepresent a novel knowledge graph-enhanced ap-proach for Legal Statute Identification (LSI) inIndian legal documents using Large LanguageModels, incorporating structural relationshipsfrom the Indian Penal Code (IPC) the main leg-islation codifying criminal laws in India. Onthe IL-TUR benchmark, explicit KG inferencesignificantly enhances recall without sacrific-ing competitive precision. Augmenting LLMprompts with KG context, though, merely en-hances coverage at the expense of precision,underscoring the importance of good rerank-ing techniques. This research provides the firstcomplete IPC knowledge graph and shows thatorganized legal relations richly augment statuteretrieval, subject to being integrated into lan-guage models in a judicious way. Our code anddata are publicly available at Github. (https://github.com/SiddharthShukla48/NyayGraph)

pdf bib
Evaluating LLM-Generated Legal Explanations for Regulatory Compliance in Social Media Influencer Marketing
Haoyang Gui | Thales Bertaglia | Taylor Annabell | Catalina Goanta | Tjomme Dooper | Gerasimos Spanakis

The rise of influencer marketing has blurred boundaries between organic content and sponsored content, making the enforcement of legal rules relating to transparency challenging. Effective regulation requires applying legal knowledge with a clear purpose and reason, yet current detection methods of undisclosed sponsored content generally lack legal grounding or operate as opaque “black boxes.” Using 1,143 Instagram posts, we compare gpt-5-nano and gemini-2.5-flash-lite under three prompting strategies with controlled levels of legal knowledge provided. Both models perform strongly in classifying content as sponsored or not (F1 up to 0.93), with Gemini favouring recall (0.93) and GPT favouring precision (0.95), though performance drops by over 10 points on ambiguous cases. We further develop a taxonomy of reasoning errors, showing frequent citation omissions (28.57%), unclear references (20.71%), and hidden ads exhibiting the highest miscue rate (28.57%). While adding regulatory text to the prompt improves explanation quality, it does not consistently improve detection accuracy. The contribution of this paper is threefold. First, it makes a novel addition to regulatory compliance technology by providing a taxonomy of common errors in LLM-generated legal reasoning to evaluate whether automated moderation is not only accurate but also legally robust, thereby advancing the transparent detection of influencer marketing content. Second, it features an original dataset of LLM explanations annotated by two students who were trained in influencer marketing law. Third, it combines quantitative and qualitative evaluation strategies for LLM explanations and critically reflects on how these findings can support advertising regulatory bodies in automating moderation processes on a solid legal foundation.

pdf bib
Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification
M. Mikail Demir | M Abdullah Canbaz

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split: Google’s Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI’s GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

pdf bib
Labor Lex: A New Portuguese Corpus and Pipeline for Information Extraction in Brazilian Legal Texts
Pedro Vitor Quinta de Castro | Nádia Félix Felipe Da Silva

Relation Extraction (RE) is a challenging Natural Language Processing task that involves identifying named entities from text and classifying the relationships between them. When applied to a specific domain, the task acquires a new layer of complexity, handling the lexicon and context particular to the domain in question. In this work, this task is applied to the Legal domain, specifically targeting Brazilian Labor Law. Architectures based on Deep Learning, with word representations derived from Transformer Language Models (LM), have shown state-of-the-art performance for the RE task. Recent works on this task handle Named Entity Recognition (NER) and RE either as a single joint model or as a pipelined approach. In this work, we introduce Labor Lex, a newly constructed corpus based on public documents from Brazilian Labor Courts. We also present a pipeline of models trained on it. Different experiments are conducted for each task, comparing supervised training using LMs and In-Context Learning (ICL) with Large Language Models (LLM), and verifying and analyzing the results for each one. For the NER task, the best achieved result was 89.97% F1-Score, and for the RE task, the best result was 82.38% F1-Score. The best results for both tasks were obtained using the supervised training approach.

pdf bib
Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks
Davide Romano | Jonathan Richard Schwarz | Daniele Giofrè

Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming (Snell et al., 2024; Chen et al., 2024), its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-N) and process-level (tree search) verification under realistic low-N budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

pdf bib
Domain Adapted Text Summarization with Self-Generated Guidelines
Andrianos Michail | Bartosz Rudnikowicz | Pavlos Fragkogiannis | Cristina Kadar

Text summarization systems face significant adaptation costs when deployed across diverse domains, requiring expensive few-shot learning or manual prompt engineering. We propose a cost-effective domain adaptation framework that generates reusable summarization guidelines using only two reference summaries and three LLM inferences. Our approach works by having the model compare its own generated summaries against domain specific reference summaries in a one time preparation step that derives concise natural language guidelines that capture the summarization patterns of the target domain. These guidelines are then appended to the summarization prompt to adapt the LLM to the target domain at a minimal cost. We evaluate our method across diverse model sizes on three distinct summarization domains: Lawsuits, ArXiv papers, and Patents. Automatic metrics show that guideline-based adaptation achieves comparable or superior performance compared to in-context learning and zero-shot baselines. An LLM preference evaluation using the latest models shows that summaries generated using such guidelines are superior to the zero-shot or in-context learning summarization prompts. Our method enables efficient domain adaptation of text summarizer LLMs with a minimal resource overhead, making specialized summarization particularly accessible for agentic systems that require to process heterogeneous texts in enterprise environments.

pdf bib
PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks
Yehoon Jang | Chaewon Lee | Hyun-seok Min | Sungchul Choi

The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of *ex parte* appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. To address this gap, we introduce **PILOT-Bench** (**P**atent **I**nva**L**idati**O**n **T**rial Benchmark), a dataset and benchmark that aligns PTAB decisions with USPTO patent data at the case level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of commercial and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, commercial models consistently exceed 0.75 in Exact Match, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting the substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.

pdf bib
Efficient Prompt Optimisation for Legal Text Classification with Proxy Prompt Evaluator
Hyunji Lee | Kevin Chenhao Li | Matthias Grabmair | Shanshan Xu

Prompt optimization aims to systematically refine prompts to enhance a language model’s performance on specific tasks. Fairness detection in Terms of Service (ToS) clauses is a challenging legal NLP task that demands carefully crafted prompts to ensure reliable results. However, existing prompt optimization methods are often computationally expensive due to inefficient search strategies and costly prompt candidate scoring. In this paper, we propose a framework that combines Monte Carlo Tree Search (MCTS) with a proxy prompt evaluator to more effectively explore the prompt space while reducing evaluation costs. Experiments demonstrate that our approach achieves higher classification accuracy and efficiency than baseline methods under a constrained computation budget.

pdf bib
ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Identification in Commercial Contracts
Shuang Liu | Zelong Li | Ruoyun Ma | Haiyan Zhao | Mengnan Du

The potential of large language models (LLMs) in contract legal risk analysis remains underexplored. In response, this paper introduces ContractEval, the first benchmark to thoroughly evaluate whether open-source LLMs could match proprietary LLMs in identifying clause-level legal risks in commercial contracts. Using the Contract Understanding Atticus Dataset (CUAD), we assess 4 proprietary and 15 open-source LLMs. Our results highlight five key findings: (1) Proprietary models outperform open-source models in both correctness and output effectiveness. (2) Larger open-source models generally perform better, though the improvement slows down as models get bigger. (3) Reasoning (“thinking”) mode improves output effectiveness but reduces correctness, likely due to over-complicating simpler tasks. (4) Open-source models generate “no related clause” responses more frequently even when relevant clauses are present. (5) Model quantization speed up inference but at the cost of performance drop, showing the tradeoff between efficiency and accuracy. These findings suggest that while most LLMs perform at a level comparable to junior legal assistants, open-source models require targeted fine-tuning to ensure correctness and effectiveness in high-stakes legal settings. ContractEval offers a solid benchmark to guide future development of legal-domain LLMs.

pdf bib
Contemporary LLMs struggle with extracting formal legal arguments
Lena Held | Ivan Habernal

Legal Argument Mining (LAM) is a complex challenge for humans and language models alike. This paper explores the application of Large Language Models (LLMs) in LAM, focusing on the identification of fine-grained argument types within judgment texts. We compare the performance of Flan-T5 and Llama 3 models against a baseline RoBERTa model to study if the advantages of magnitude-bigger LLMs can be leveraged for this task. Our study investigates the effectiveness of fine-tuning and prompting strategies in enhancing the models’ ability to discern nuanced argument types. Despite employing state-of-the-art techniques, our findings indicate that neither fine-tuning nor prompting could surpass the performance of a domain-pre-trained encoder-only model. This highlights the challenges and limitations in adapting general-purpose large language models to the specialized domain of legal argumentation. The insights gained from this research contribute to the ongoing discourse on optimizing NLP models for complex, domain-specific tasks. Our code and data for reproducibility are available at https://github.com/trusthlt/legal-argument-spans.

pdf bib
Aligning LLMs for Thai Legal Question Answering with Efficient Semantic-Similarity Rewards
Pawitsapak Akarajaradwong | Chompakorn Chaksangchaichot | Pirat Pothavorn | Ekapol Chuangsuwanich | Attapol Rutherford | Sarana Nutanong

The Retrieval-Augmented Generation (RAG) systems’ performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce a resource-efficient approach that aligns Large Language Models (LLMs) for improved citation accuracy and response quality using Group-Relative Policy Optimization (GRPO). Our proposed method leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to an LLM-based reward model. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains relative to the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our approach provides a practical and effective solution for enhancing legal LLMs in resource-constrained environments.

pdf bib
Not ready for the bench: LLM legal interpretation is unstable and uncalibrated to human judgments
Abhishek Purushothama | Junghyun Min | Brandon Waldon | Nathan Schneider

Legal interpretation frequently involves assessing how a legal text, as understood by an ‘ordinary’ speaker of the language, applies to the set of facts characterizing a legal dispute. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM-assisted interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments and are susceptible to subtle variations in the prompt. While instruction tuning slightly improves model calibration to human judgments, even the best-calibrated LLMs remain weak predictors of human native speakers’ judgments.

pdf bib
LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation
Joseph Enguehard | Morgane Van Ermengem | Kate Atkinson | Sujeong Cha | Arijit Ghosh Chowdhury | Prashanth Kallur Ramaswamy | Jeremy Roghair | Hannah R Marlowe | Carina Suzana Negreanu | Kitty Boxall | Diana Mincu

Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications.Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability.This paper aims to close the gap: a) we break down lengthy responses into “Legal Data Points” (LDPs) — self-contained units of information — and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.

pdf bib
A Framework to Retrieve Relevant Laws for Will Execution
Md Asiful Islam | Alice Saebom Kwak | Derek Bambauer | Clayton T Morrison | Mihai Surdeanu

Wills must comply with jurisdiction-specific statutory provisions to be valid, but retrieving the relevant laws for execution, validation, and probate remains labor-intensive and error-prone. Prior legal information retrieval (LIR) research has addressed contracts, criminal law, and judicial decisions, but wills and probate law remain largely unexplored, with no prior work on retrieving statutes for will validity assessment. We propose a legal information retrieval framework that combines lexical and semantic retrieval in a hybrid pipeline with large language model (LLM) reasoning to retrieve the most relevant provisions for a will statement. Evaluations on annotated will-statement datasets from the U.S. states of Tennessee and Idaho using six LLMs show that our hybrid framework consistently outperforms zero-shot baselines. Notably, when paired with our hybrid retrieval pipeline, GPT-5-mini achieves the largest relative accuracy gains, improving by 41.09 points on the Tennessee and 48.68 points on the Idaho test set. We observed similarly strong improvements across all models and datasets.

pdf bib
CourtNav: Voice-Guided, Anchor-Accurate Navigation of Long Legal Documents in Courtrooms
Sai Khadloya | Kush Juvekar | Arghya Bhattacharya | Utkarsh Saxena

Judicial work depends on close reading of longrecords, charge sheets, pleadings, annexures,orders, often spanning hundreds of pages. Withlimited staff support, exhaustive reading duringhearings is impractical. We present CourtNav,a voice-guided, anchor-first navigator for legalPDFs that maps a judge’s spoken command(e.g., “go to paragraph 23”, “highlight the contradiction in the cross-examination”) directlyto a highlighted paragraph in seconds. CourtNav transcribes the command, classifies intentwith a grammar-first, LLM-backed router, retrieves over a layout-aware hybrid index, andauto-scrolls the viewer to the cited span whilehighlighting it and close alternates. By design, the interface shows only grounded pas-sages, never free text, keeping evidence verifiable and auditable. This need is acute in India, where judgments and cross-examinations notoriously long.In a pilot on representative charge sheets, pleadings, and orders, median time-to-relevance drops from 3–5 minutes (manual navigation) to 10–15 seconds;with quick visual verification included, 30–45seconds. Under fixed time budgets, thisnavigation-first design increases the breadth ofthe record actually consulted while preservingcontrol and transparency

pdf bib
Are LLMs Court-Ready? Evaluating Frontier Models on Indian Legal Reasoning
Kush Juvekar | Arghya Bhattacharya | Sai Khadloya | Utkarsh Saxena

Large language models (LLMs) are moving into legal workflows, yet we lack a jurisdiction-grounded way to gauge their basic competence in thereof. We use India’s public legal examinations as a transparent proxy. Our multi-year benchmark assembles objective screens from top national and state exams and evaluates open and frontier LLMs under real world exam conditions. To probe beyond MCQs, we also include a lawyer-graded, paired-blinded study of long-form answers from the Supreme Court’s Advocate-on-Record exam. This is, to our knowledge, the first exam-grounded, India-specific yardstick for LLM court-readiness released with datasets and protocols. Our work shows that while frontier systems consistently clear historical cutoffs and often match or exceed recent top-scorer bands on objective exams, none surpasses the human topper on long-form reasoning. Grader notes converge on three reliability failure modes—procedural/format compliance, authority/citation discipline, and forum-appropriate voice/structure. These findings delineate where LLMs can assist (checks, cross-statute consistency, statute and precedent lookups) and where human leadership remains essential: forum-specific drafting and filing, procedural and relief strategy, reconciling authorities and exceptions, and ethical, accountable judgment.

pdf bib
LegalSim: Multi-Agent Simulation of Legal Systems for Discovering Procedural Exploits
Sanket Badhe

We present LegalSim, a modular multi-agent simulation of adversarial legal proceedings that explores how AI systems can exploit procedural weaknesses in codified rules. Plaintiff and defendant agents choose from a constrained action space (for example, discovery requests, motions, meet-and-confer, sanctions) governed by a JSON rules engine, while a stochastic judge model with calibrated grant rates, cost allocations, and sanction tendencies resolves outcomes. We compare four policies: PPO, a contextual bandit with an LLM, a direct LLM policy, and a hand-crafted heuristic; Instead of optimizing binary case outcomes, agents are trained and evaluated using effective win rate and a composite exploit score that combines opponent-cost inflation, calendar pressure, settlement pressure at low merit, and a rule-compliance margin. Across configurable regimes (e.g., bankruptcy stays, inter partes review, tax procedures) and heterogeneous judges, we observe emergent “exploit chains”, such as cost-inflating discovery sequences and calendar-pressure tactics that remain procedurally valid yet systemically harmful. Evaluation via cross-play and Bradley-Terry ratings shows, PPO wins more often, the bandit is the most consistently competitive across opponents, the LLM trails them, and the heuristic is weakest. The results are stable in judge settings, and the simulation reveals emergent exploit chains, motivating red-teaming of legal rule systems in addition to model-level testing.

pdf bib
Linking Transparency and Accountability: Analysing The Connection Between TikTok’s Terms of Service and Moderation Decisions
Leonard Eßer | Gerasimos Spanakis

The European Commission’s Digital Services Act (DSA) mandates that Very Large Online Platforms (VLOPs), like TikTok, provide Statements of Reason (SoRs) to justify their content moderation decisions in an attempt to enhance transparency and accountability for these platforms. However, we can often notice a gap between these automated decisions and the platform’s written policies. This leaves users unable to understand the specific rule they have violated. This paper addresses this gap by developing and evaluating a pipeline to link TikTok’s SoRs from the DSA transparency database to the most relevant clause from TikTok’s policy documents. We test multiple methods to perform the linking task and evaluate performance using a wide range of retrieval methods and metrics.We develop and deliver a gold-standard dataset where a team of legal research assistants annotated 100 SoRs based on four criteria: clarity, understanding, presence of unclear terms and level of detail, each rated on a 1–4 scale. In addition, a binary rating is assigned for redress clarity. Moreover, annotators determined the best link to the relevant TikTok policy clauses. Results show that both TikTok’s SoRs and policy clauses are often extremely broad, granting TikTok more freedom to decide how to apply the clauses, making it even less transparent for users. We also provide a demo that, for each SoR, provides a ranking of the most relevant clauses from TikTok’s written policies, a tool that can be useful for users, regulators and researchers to better understand content moderation decisions, assess compliance with transparency requirements, and support further analysis of platform accountability.

pdf bib
Risks and Limits of Automatic Consolidation of Statutes
Max Prior | Adrian Hof | Niklas Wais | Matthias Grabmair

As in many countries of the Civil Law tradition, consolidated versions of statutes - statutes with added amendments - are difficult to obtain reliably and promptly in Germany. This gap has prompted interest in using large language models (LLMs) to ‘synthesize’ current and historical versions from amendments. Our paper experiments with an LLM-based consolidation framework and a dataset of 908 amendment–law pairs drawn from 140 Federal Law Gazette documents across four major codes. While automated metrics show high textual similarity (93-99%) for single-step and multi-step amendment chains, only 50.3% of exact matches (single-step) and 20.51% (multi-step) could be achieved; our expert assessment reveals that non-trivial errors persist and that even small divergences can carry legal significance. We therefore argue that any public or private deployment must treat outputs as drafts subject to rigorous human verification.

pdf bib
GReX: A Graph Neural Network-Based Rerank-then-Expand Method for Detecting Conflicts Among Legal Articles in Korean Criminal Law
Seonho An | Young-Yik Rhim | Min-Soo Kim

As social systems become more complex, legal articles have grown increasingly intricate, making it harder for humans to identify potential conflicts among them, particularly when drafting new laws or applying existing ones. Despite its importance, no method has been proposed to detect such conflicts. We introduce a new legal NLP task, Legal Article Conflict Detection (LACD), which aims to identify conflicting articles within a given body of law. To address this task, we propose GReX, a novel graph neural network-based retrieval method. Experimental results show that GReX significantly outperforms existing methods, achieving improvements of 44.8% in nDCG@50, 32.8% in Recall@50, and 39.8% in Retrieval F1@50. Our codes are in github.com/asmath472/LACD-public.

pdf bib
GuRE:Generative Query REwriter for Legal Passage Retrieval
Daehui Kim | Deokhyung Kang | Jonghwi Kim | Sangwon Ryu | Gary Lee

Legal Passage Retrieval (LPR) systems are crucial as they help practitioners save time when drafting legal arguments. However, it remains an underexplored avenue. One primary reason is the significant vocabulary mismatch between the query and the target passage. To address this, we propose a simple yet effective method, the Generative query REwriter (GuRE). We leverage the generative capabilities of Large Language Models (LLMs) by training the LLM for query rewriting. "Rewritten queries" help retrievers to retrieve target passages by mitigating vocabulary mismatch. Experimental results show that GuRE significantly improves performance in a retriever-agnostic manner, outperforming all baseline methods. Further analysis reveals that different training objectives lead to distinct retrieval behaviors, making GuRE more suitable than direct retriever fine-tuning for real-world applications. Codes are avaiable at github.com/daehuikim/GuRE.

pdf bib
Extract-Explain-Abstract: A Rhetorical Role-Driven Domain-Specific Summarisation Framework for Indian Legal Documents
Veer Chheda | Aaditya Uday Ghaisas | Avantika Sankhe | Dr. Narendra Shekokar

Legal documents are characterized by theirlength, intricacy, and dense use of jargon, making efficacious summarisation both paramountand challenging. Existing zero-shot methodologies in small language models struggle tosimplify this jargon and are prone to punts andhallucinations with longer prompts. This paperintroduces the Rhetorical Role-based Extract-Explain-Abstract (EEA) Framework, a novelthree-stage methodology for summarisation ofIndian legal documents in low-resource settings. The approach begins by segmenting legaltexts using rhetorical roles, such as facts, issues and arguments, through a domain-specificphrase corpus and extraction based on TF-IDF.In the explanation stage, the segmented output is enriched with logical connections to ensure coherence and legal fidelity. The final abstraction phase condenses these interlinked segments into cogent, high-level summaries thatpreserve critical legal reasoning. Experimentson Indian legal datasets show that the EEAframework typically outperforms in ROUGE,BERTScore, Flesch Reading Ease, Age of Acquisition, SummaC and human evaluations. Wealso employ InLegalBERTScore as a metric tocapture domain specific semantics of Indianlegal documents.