Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks

Davide Romano, Jonathan Richard Schwarz, Daniele Giofrè


Abstract
Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming (Snell et al., 2024; Chen et al., 2024), its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-N) and process-level (tree search) verification under realistic low-N budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.
Anthology ID:
2025.nllp-1.15
Volume:
Proceedings of the Natural Legal Language Processing Workshop 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Nikolaos Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goanță, Daniel Preoțiuc-Pietro, Gerasimos Spanakis
Venues:
NLLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
207–225
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.nllp-1.15/
DOI:
Bibkey:
Cite (ACL):
Davide Romano, Jonathan Richard Schwarz, and Daniele Giofrè. 2025. Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks. In Proceedings of the Natural Legal Language Processing Workshop 2025, pages 207–225, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks (Romano et al., NLLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.nllp-1.15.pdf