Scaling Laws or Threshold Effects: Exploring the Optimal Vocabulary Size for Balancing Performance and Efficiency in Low-Resource Languages

Ao Han, Andong Chen, Yuan Sun, Xiaobing Zhao


Abstract
While vocabulary expansion scaling laws are well-established for high-resource languages, they remain unverified in low-resource settings. This gap is particularly critical for Byte-level BPE (BBPE), where constrained vocabulary sizes often fail to capture the rich morphemes of complex scripts, leading to severe over-segmentation in languages such as Mongolian, Tibetan, and Uyghur. We systematically investigate jointly-scaled trilingual vocabulary for these languages (140 to 195,000 tokens) across BPE (Llama 2) and BBPE (Qwen2.5/3) architectures. Our results reveal that BBPE follows a "decline-then-rise" pattern, requiring a 9,000-token threshold (3,000 per language) to trigger non-linear performance gains and inference acceleration, whereas BPE improves monotonically. Using Pareto Frontier Analysis, we identify an optimal 79,500-token configuration for BBPE that reduces continuous pre-training duration by over 71% across 1.5B to 8B parameter models while consistently enhancing downstream performance.
Anthology ID:
2026.findings-acl.1588
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
31741–31758
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1588/
DOI:
Bibkey:
Cite (ACL):
Ao Han, Andong Chen, Yuan Sun, and Xiaobing Zhao. 2026. Scaling Laws or Threshold Effects: Exploring the Optimal Vocabulary Size for Balancing Performance and Efficiency in Low-Resource Languages. In Findings of the Association for Computational Linguistics: ACL 2026, pages 31741–31758, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Scaling Laws or Threshold Effects: Exploring the Optimal Vocabulary Size for Balancing Performance and Efficiency in Low-Resource Languages (Han et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1588.pdf
Checklist:
 2026.findings-acl.1588.checklist.pdf