LawToken: a single token worth more than its constituents

Yu-Hsiang Tseng; Hsin-Yu Chou; Shu-Kai Hsieh

LawToken: a single token worth more than its constituents

Yu-Hsiang Tseng, Hsin-Yu Chou, Shu-Kai Hsieh

Abstract

Legal citations require correctly recalling the law references of complex law article names and article numbering, which large language models typically treat as multi-token sequences. Motivated by the form-meaning pair of constructionist approaches, we explore treating these multi-token law references as a single holistic law token and examining the implications for legal citation accuracy and differences in model interpretability. We train and compare two types of models: LawToken models, which encode the legal citations as a single law token, and LawBase models, which treat them as multi-token compounds. The results show that LawToken models outperform LawBase models on legal citation tasks, primarily due to fewer errors in the article numbering components. Further model representation analysis reveals that, while both models achieve comparable semantic representation quality, the multi-token-based LawBase suffers from degraded representations in multistep decoding, leading to more errors. Taken together, these findings suggest that form-meaning pairing can operate in a larger context, and this larger unit may offer advantages in future modeling of legal reasoning. In practice, this approach can significantly reduce the likelihood of hallucinations by anchoring legal citations as discrete, holistic tokens, thereby minimizing the risk of generating nonexistent or incorrect legal references.

Anthology ID:: 2025.conll-1.3
Volume:: Proceedings of the 29th Conference on Computational Natural Language Learning
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Gemma Boleda, Michael Roth
Venues:: CoNLL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30–46
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.conll-1.3/
DOI:
Bibkey:
Cite (ACL):: Yu-Hsiang Tseng, Hsin-Yu Chou, and Shu-Kai Hsieh. 2025. LawToken: a single token worth more than its constituents. In Proceedings of the 29th Conference on Computational Natural Language Learning, pages 30–46, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: LawToken: a single token worth more than its constituents (Tseng et al., CoNLL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.conll-1.3.pdf

PDF Cite Search Fix data