A Comprehensive Analysis of Memorization in Large Language Models
Hirokazu Kiyomaru, Issa Sugiura, Daisuke Kawahara, Sadao Kurohashi
Abstract
This paper presents a comprehensive study that investigates memorization in large language models (LLMs) from multiple perspectives. Experiments are conducted with the Pythia and LLM-jp model suites, both of which offer LLMs with over 10B parameters and full access to their pre-training corpora. Our findings include: (1) memorization is more likely to occur with larger model sizes, longer prompt lengths, and frequent texts, which aligns with findings in previous studies; (2) memorization is less likely to occur for texts not trained during the latter stages of training, even if they frequently appear in the training corpus; (3) the standard methodology for judging memorization can yield false positives, and texts that are infrequent yet flagged as memorized typically result from causes other than true memorization.- Anthology ID:
- 2024.inlg-main.45
- Volume:
- Proceedings of the 17th International Natural Language Generation Conference
- Month:
- September
- Year:
- 2024
- Address:
- Tokyo, Japan
- Editors:
- Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
- Venue:
- INLG
- SIG:
- SIGGEN
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 584–596
- Language:
- URL:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2024.inlg-main.45/
- DOI:
- Cite (ACL):
- Hirokazu Kiyomaru, Issa Sugiura, Daisuke Kawahara, and Sadao Kurohashi. 2024. A Comprehensive Analysis of Memorization in Large Language Models. In Proceedings of the 17th International Natural Language Generation Conference, pages 584–596, Tokyo, Japan. Association for Computational Linguistics.
- Cite (Informal):
- A Comprehensive Analysis of Memorization in Large Language Models (Kiyomaru et al., INLG 2024)
- PDF:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2024.inlg-main.45.pdf