David Stein
2025
Building a Long Text Privacy Policy Corpus with Multi-Class Labels
Florencia Marotta-Wurgler
|
David Stein
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Legal text poses distinctive challenges for natural language processing. The legal import of a term may depend on omissions, cross-references, or silence, Further, legal text is often susceptible to multiple valid, conflicting interpretations; as the saying goes: a good lawyer’s answer to any question is “it depends.”This work introduces a new, hand-coded dataset for the interpretation of privacy policies. It includes privacy policies from 149 firms, including materials incorporated by reference. The policies are annotated across 64 dimension that reflect the applicable legal rules and contested terms from EU and US privacy regulation and litigation. Our annotation methodology is designed to capture the capture core challenges peculiar to legal language, including indeterminacy, interdependence between clauses, meaningful silence, and the implications of legal defaults. We present a set of baseline results for the dataset using current large language models.