Parallel Corpus Filtering Based on Semantic Similarity and Surface Dissimilarity for Japanese Text Simplification with LLMs

Daisuke Maekawa, Tomoyuki Kajiwara, Takashi Ninomiya


Abstract
We are focusing on low-cost fine-tuning for large language models (LLMs) in Japanese text simplification. LLMs have achieved high performance even with fine-tuning on small parallel corpora in tasks such as machine translation and dialogue response generation. In this study, we propose a method of parallel corpus filtering for text simplification and investigate how much the number of sentence pairs for fine-tuning LLMs can be reduced. Experimental results on Japanese corpora in three domains revealed that the ability to perform text simplification tasks can be acquired even from a very small corpus of 16 to 64 sentence pairs. Although more parallel corpora are needed to acquire domain knowledge, our method outperformed full fine-tuning while reducing the training corpus by approximately 70%.
Anthology ID:
2026.lrec-main.86
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
1110–1116
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.86/
DOI:
Bibkey:
Cite (ACL):
Daisuke Maekawa, Tomoyuki Kajiwara, and Takashi Ninomiya. 2026. Parallel Corpus Filtering Based on Semantic Similarity and Surface Dissimilarity for Japanese Text Simplification with LLMs. International Conference on Language Resources and Evaluation, main:1110–1116.
Cite (Informal):
Parallel Corpus Filtering Based on Semantic Similarity and Surface Dissimilarity for Japanese Text Simplification with LLMs (Maekawa et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.86.pdf