Can Perplexity Predict Finetuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

Nishant Luitel, Nirajan Bekoju, Anand Kumar Sah, Subarna Shakya


Abstract
The impact of subword tokenization on language model performance is well-documented for perplexity, with finer granularity consistently reducing this intrinsic metric. However, research on how different tokenization schemes affect a model’s understanding capabilities remains limited, particularly for non-Latin script languages. Addressing this gap, we conducted a comprehensive evaluation of six distinct tokenization strategies by pretraining transformer-based language models for Nepali and evaluating their performance across multiple downstream tasks. While recent prominent models like GPT, RoBERTa, Claude, LLaMA, Mistral, Falcon, and MPT have adopted byte-level BPE tokenization, our findings demonstrate that for Nepali, SentencePiece tokenization consistently yields superior results on understanding-based tasks. Unlike previous studies that primarily focused on BERT-based architectures, our research specifically examines sequential transformer models, providing valuable insights for language model development in low-resource languages and highlighting the importance of tokenization strategy beyond perplexity reduction.
Anthology ID:
2025.gem-1.21
Volume:
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:
July
Year:
2025
Address:
Vienna, Austria and virtual meeting
Editors:
Kaustubh Dhole, Miruna Clinciu
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
239–248
Language:
URL:
https://preview.aclanthology.org/transition-to-people-yaml/2025.gem-1.21/
DOI:
Bibkey:
Cite (ACL):
Nishant Luitel, Nirajan Bekoju, Anand Kumar Sah, and Subarna Shakya. 2025. Can Perplexity Predict Finetuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 239–248, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Can Perplexity Predict Finetuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali (Luitel et al., GEM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/transition-to-people-yaml/2025.gem-1.21.pdf