Odysseas S. Chlapanis


2025

pdf bib
GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations
Odysseas S. Chlapanis | Dimitris Galanis | Nikolaos Aletras | Ion Androutsopoulos
Findings of the Association for Computational Linguistics: EMNLP 2025

We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our extensive evaluation of 13 proprietary and open-weight LLMs shows that even though the top models exhibit impressive performance, they remain susceptible to critical errors, most notably a failure to identify the correct statutory articles.

pdf bib
AUEB-Archimedes at RIRAG-2025: Is Obligation concatenation really all you need?
Ioannis Chasandras | Odysseas S. Chlapanis | Ion Androutsopoulos
Proceedings of the 1st Regulatory NLP Workshop (RegNLP 2025)

This paper presents the systems we developed for RIRAG-2025, a shared task that requires answering regulatory questions by retrieving relevant passages. The generated answers are evaluated using RePASs, a reference-free and model-based metric. Our systems use a combination of three retrieval models and a reranker. We show that by exploiting a neural component of RePASs that extracts important sentences (‘obligations’) from the retrieved passages, we achieve a dubiously high score (0.947), even though the answers are directly extracted from the retrieved passages and are not actually generated answers. We then show that by selecting the answer with the best RePASs among a few generated alternatives and then iteratively refining this answer by reducing contradictions and covering more obligations, we can generate readable, coherent answers that achieve a more plausible and relatively high score (0.639).

2024

pdf bib
LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights
Odysseas S. Chlapanis | Dimitrios Galanis | Ion Androutsopoulos
Proceedings of the Natural Legal Language Processing Workshop 2024

We present Legal Argument Reasoning (LAR), a novel task designed to evaluate the legal reasoning capabilities of Large Language Models (LLMs). The task requires selecting the correct next statement (from multiple choice options) in a chain of legal arguments from court proceedings, given the facts of the case. We constructed a dataset (LAR-ECHR) for this task using cases from the European Court of Human Rights (ECHR). We evaluated seven general-purpose LLMs on LAR-ECHR and found that (a) the ranking of the models is aligned with that of LegalBench, an established US-based legal reasoning benchmark, even though LAR-ECHR is based on EU law, (b) LAR-ECHR distinguishes top models more clearly, compared to LegalBench, (c) even the best model (GPT-4o) obtains 75.8% accuracy on LAR-ECHR, indicating significant potential for further model improvement. The process followed to construct LAR-ECHR can be replicated with cases from other legal systems.