Maria Bandulevich


2026

Frequent revisions of complex regulatory documents in large organizations often introduce inconsistencies and contradictions that are difficult for lawyers and auditors to detect manually. Existing tools rely on character-level diffs and therefore miss paraphrases and semantic shifts. We introduce LegDiff, a novel benchmark for evaluating span-aware semantic comparison of legal texts, and use it to investigate the ability of large language models to detect semantic changes beyond token- and character-level matching. LegDiff comprises manually annotated pairs of legal paragraphs drawn from different documents. In addition, we present a pipeline to generate synthetic training data that aligns with the manual annotations and mirrors the structure and label distribution of the manually curated benchmark, and a visualization tool for clearly displaying detected differences and inconsistencies. The dataset, code, and a visualization tool are publicly available to facilitate reproducibility and further research (https://github.com/s-nlp/SLeDoC).