The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu


Abstract
Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge suddenly and only late in training. We find that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.
Anthology ID:
2025.emnlp-main.1434
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28240–28251
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1434/
DOI:
Bibkey:
Cite (ACL):
Adrian Cosma, Stefan Ruseti, Emilian Radoi, and Mihai Dascalu. 2025. The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28240–28251, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models (Cosma et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1434.pdf
Checklist:
 2025.emnlp-main.1434.checklist.pdf