How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

Disen Liao, Freda Shi


Abstract
Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs’ ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model’s tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological awareness into LMs, leading to consistent improvements across three phonology-related tasks while largely preserving math and general reasoning ability, with 1.1% and 0.9% drops on GSM8K and MMLU, respectively.[Our code is available at <https://github.com/liaodisen/Tokenization-Phonology>]
Anthology ID:
2026.acl-long.634
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13921–13938
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.634/
DOI:
Bibkey:
Cite (ACL):
Disen Liao and Freda Shi. 2026. How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13921–13938, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them (Liao & Shi, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.634.pdf
Checklist:
 2026.acl-long.634.checklist.pdf