Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora

Hiromu Takahashi; Shotaro Ishihara

Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora

Abstract

Despite the growing concern about memorization of training data using large language models (LLMs), there has been insufficient analysis under conditions using non-English or industry-specific corpora.This study focuses on continual pre-training, a common approach in building non-English LLMs, and quantifies memorization of training data.Specifically, we trained two models based on Llama 3 using Japanese Wikipedia (general) and Japanese financial news articles (industry-specific).Experiments showed a tendency for the amount of memorization to increase as training progressed, similar to the empirical findings for English.This trend was clear in the industry-specific corpus, suggesting potential risks when using valuable, non-general industry corpora.We also identified issues specific to Japanese, and emphasized the importance of analysis other than in English.

Anthology ID:: 2025.l2m2-1.8
Volume:: Proceedings of the First Workshop on Large Language Model Memorization (L2M2)
Month:: August
Year:: 2025
Address:: Vienna, Austria
Editors:: Robin Jia, Eric Wallace, Yangsibo Huang, Tiago Pimentel, Pratyush Maini, Verna Dankers, Johnny Wei, Pietro Lesci
Venues:: L2M2 | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 95–105
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.l2m2-1.8/
DOI:
Bibkey:
Cite (ACL):: Hiromu Takahashi and Shotaro Ishihara. 2025. Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora. In Proceedings of the First Workshop on Large Language Model Memorization (L2M2), pages 95–105, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora (Takahashi & Ishihara, L2M2 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.l2m2-1.8.pdf

PDF Cite Search Fix data