Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora

Hiromu Takahashi, Shotaro Ishihara


Abstract
Despite the growing concern about memorization of training data using large language models (LLMs), there has been insufficient analysis under conditions using non-English or industry-specific corpora.This study focuses on continual pre-training, a common approach in building non-English LLMs, and quantifies memorization of training data.Specifically, we trained two models based on Llama 3 using Japanese Wikipedia (general) and Japanese financial news articles (industry-specific).Experiments showed a tendency for the amount of memorization to increase as training progressed, similar to the empirical findings for English.This trend was clear in the industry-specific corpus, suggesting potential risks when using valuable, non-general industry corpora.We also identified issues specific to Japanese, and emphasized the importance of analysis other than in English.
Anthology ID:
2025.l2m2-1.8
Volume:
Proceedings of the First Workshop on Large Language Model Memorization (L2M2)
Month:
August
Year:
2025
Address:
Vienna, Austria
Editors:
Robin Jia, Eric Wallace, Yangsibo Huang, Tiago Pimentel, Pratyush Maini, Verna Dankers, Johnny Wei, Pietro Lesci
Venues:
L2M2 | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
95–105
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.l2m2-1.8/
DOI:
Bibkey:
Cite (ACL):
Hiromu Takahashi and Shotaro Ishihara. 2025. Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora. In Proceedings of the First Workshop on Large Language Model Memorization (L2M2), pages 95–105, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora (Takahashi & Ishihara, L2M2 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.l2m2-1.8.pdf