Egalitarian Language Representation in Language Models: It All Begins with Tokenizers

Menan Velayuthan; Kengatharaiyer Sarveswaran

Egalitarian Language Representation in Language Models: It All Begins with Tokenizers

Menan Velayuthan, Kengatharaiyer Sarveswaran

Abstract

Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Despite the dominance of English-Centric (EC) Large Language Models (LLMs), tokenization methods often fail to fairly represent complex scripts like Tamil, Sinhala, and Hindi, primarily due to pre-tokenization choices. This study demonstrates that pre-tokenization has a more significant impact than tokenization algorithms on achieving egalitarian representation. To address this, we introduce an improvement to the Byte Pair Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme Pair Encoding (GPE). Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts. We validate this approach through experiments on Tamil, Sinhala, and Hindi. The codebase and resources used in this work are publicly available at https://github.com/vmenan/tokenizers-coling2025.

Anthology ID:: 2025.coling-main.400
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5987–5996
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.400/
DOI:
Bibkey:
Cite (ACL):: Menan Velayuthan and Kengatharaiyer Sarveswaran. 2025. Egalitarian Language Representation in Language Models: It All Begins with Tokenizers. In Proceedings of the 31st International Conference on Computational Linguistics, pages 5987–5996, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Egalitarian Language Representation in Language Models: It All Begins with Tokenizers (Velayuthan & Sarveswaran, COLING 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.400.pdf

PDF Cite Search Fix data