Spelling-out is not Straightforward: LLMs’ Capability of Tokenization from Token to Characters

Tatsuya Hiraoka, Kentaro Inui


Abstract
Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct “breakthrough” in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.
Anthology ID:
2025.findings-emnlp.719
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13340–13353
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.719/
DOI:
10.18653/v1/2025.findings-emnlp.719
Bibkey:
Cite (ACL):
Tatsuya Hiraoka and Kentaro Inui. 2025. Spelling-out is not Straightforward: LLMs’ Capability of Tokenization from Token to Characters. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 13340–13353, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Spelling-out is not Straightforward: LLMs’ Capability of Tokenization from Token to Characters (Hiraoka & Inui, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.719.pdf
Checklist:
 2025.findings-emnlp.719.checklist.pdf