Memorization in Language Models through the Lens of Intrinsic Dimension

Stefan Arnold


Abstract
Language Models (LMs) are prone to memorizing parts of their data during training and unintentionally emitting them at generation time, raising concerns about privacy leakage and disclosure of intellectual property. While previous research has identified properties such as context length, parameter size, and duplication frequency, as key drivers of unintended memorization, little is known about how the latent structure modulates this rate of memorization. We investigate the role of Intrinsic Dimension (ID), a geometric proxy for the structural complexity of a sequence in latent space, in modulating memorization. Our findings suggest that ID acts as a suppressive signal for memorization: compared to low-ID sequences, high-ID sequences are less likely to be memorized, particularly in overparameterized models and under sparse exposure. These findings highlight the interaction between scale, exposure, and complexity in shaping memorization.
Anthology ID:
2025.l2m2-1.2
Volume:
Proceedings of the First Workshop on Large Language Model Memorization (L2M2)
Month:
August
Year:
2025
Address:
Vienna, Austria
Editors:
Robin Jia, Eric Wallace, Yangsibo Huang, Tiago Pimentel, Pratyush Maini, Verna Dankers, Johnny Wei, Pietro Lesci
Venues:
L2M2 | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23–28
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.l2m2-1.2/
DOI:
10.18653/v1/2025.l2m2-1.2
Bibkey:
Cite (ACL):
Stefan Arnold. 2025. Memorization in Language Models through the Lens of Intrinsic Dimension. In Proceedings of the First Workshop on Large Language Model Memorization (L2M2), pages 23–28, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Memorization in Language Models through the Lens of Intrinsic Dimension (Arnold, L2M2 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.l2m2-1.2.pdf