Scaling Laws or Threshold Effects: Exploring the Optimal Vocabulary Size for Balancing Performance and Efficiency in Low-Resource Languages

Ao Han; Andong Chen (陈安东); Yuan Sun; Xiaobing Zhao

Scaling Laws or Threshold Effects: Exploring the Optimal Vocabulary Size for Balancing Performance and Efficiency in Low-Resource Languages

Ao Han, Andong Chen, Yuan Sun, Xiaobing Zhao

Abstract

While vocabulary expansion scaling laws are well-established for high-resource languages, they remain unverified in low-resource settings. This gap is particularly critical for Byte-level BPE (BBPE), where constrained vocabulary sizes often fail to capture the rich morphemes of complex scripts, leading to severe over-segmentation in languages such as Mongolian, Tibetan, and Uyghur. We systematically investigate jointly-scaled trilingual vocabulary for these languages (140 to 195,000 tokens) across BPE (Llama 2) and BBPE (Qwen2.5/3) architectures. Our results reveal that BBPE follows a "decline-then-rise" pattern, requiring a 9,000-token threshold (3,000 per language) to trigger non-linear performance gains and inference acceleration, whereas BPE improves monotonically. Using Pareto Frontier Analysis, we identify an optimal 79,500-token configuration for BBPE that reduces continuous pre-training duration by over 71% across 1.5B to 8B parameter models while consistently enhancing downstream performance.

Anthology ID:: 2026.findings-acl.1588
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 31741–31758
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1588/
DOI:
Bibkey:
Cite (ACL):: Ao Han, Andong Chen, Yuan Sun, and Xiaobing Zhao. 2026. Scaling Laws or Threshold Effects: Exploring the Optimal Vocabulary Size for Balancing Performance and Efficiency in Low-Resource Languages. In Findings of the Association for Computational Linguistics: ACL 2026, pages 31741–31758, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Scaling Laws or Threshold Effects: Exploring the Optimal Vocabulary Size for Balancing Performance and Efficiency in Low-Resource Languages (Han et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1588.pdf
Checklist:: 2026.findings-acl.1588.checklist.pdf

PDF Cite Search Checklist Fix data