The Token Tax: Systematic Bias in Multilingual Tokenization
Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, Cody Carroll
Abstract
Tokenization inefficiency is associated with structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and reducing accuracy. We evaluate 10 Large Language Models (LLMs) on AfriMMLU (5 subjects; 16 African languages) and show that token fertility reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (e.g., DeepSeek, o1) consistently outperform non-reasoning peers across high- and low-resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. In terms of economics, a doubling in tokens results in quadrupled training cost and time, underscoring the “token tax” faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).- Anthology ID:
- 2026.africanlp-main.10
- Volume:
- Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Everlyn Asiko Chimoto, Constantine Lignos, Shamsuddeen Muhammad, Idris Abdulmumin, Clemencia Siro, David Ifeoluwa Adelani
- Venues:
- AfricaNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 103–112
- Language:
- URL:
- https://preview.aclanthology.org/manual-author-scripts/2026.africanlp-main.10/
- DOI:
- Cite (ACL):
- Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, and Cody Carroll. 2026. The Token Tax: Systematic Bias in Multilingual Tokenization. In Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026), pages 103–112, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- The Token Tax: Systematic Bias in Multilingual Tokenization (Lundin et al., AfricaNLP 2026)
- PDF:
- https://preview.aclanthology.org/manual-author-scripts/2026.africanlp-main.10.pdf