From Bytes to Subwords: Challenges of Input Representations in NLP

Rob Van Der Goot

From Bytes to Subwords: Challenges of Input Representations in NLP

Abstract

A first decision for any automated natural language processing system is the granularity of the input units. Traditionally, characters or words have been used, but recently, subwords have become the standard. In this paper, we investigate trends in input processing steps and discuss common shortcomings in this foundational first step of model design. We start by providing an overview of currently used tokenizers, showing that there is only minimal variety, with three highly similar designs dominating current models, and many of the tokenizers being exact duplicates. Next, we reconsider Unicode normalization strategies. Previous work has recommended applying consistent normalization; however, we argue that this removes signal and we show how this can harm performance for language classification. Finally, we take a closer look at UTF-8 character encoding, the very first layer of representation used in many language models. We argue that UTF-8 is not optimized for efficiency, nor for fairness across languages, and propose proof of concept alternatives focused on fairness and efficiency. Based on our findings, we recommend future work to 1) put more thought into subword segmentation and explore more diversity, 2) apply normalization only when beneficial 3) consider alternative character encodings for models operating on the byte-level.

Anthology ID:: 2026.findings-acl.530
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10911–10919
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.530/
DOI:
Bibkey:
Cite (ACL):: Rob Van Der Goot. 2026. From Bytes to Subwords: Challenges of Input Representations in NLP. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10911–10919, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: From Bytes to Subwords: Challenges of Input Representations in NLP (Van Der Goot, Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.530.pdf
Checklist:: 2026.findings-acl.530.checklist.pdf

PDF Cite Search Checklist Fix data