Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet

Milan Mileti\'c; Julie Kallini; Ekaterina Shutova

Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet

Milan Mileti\'c, Julie Kallini, Ekaterina Shutova

Abstract

Multilingual language models often exhibit performance disparities across languages that can arise as early as the tokenization stage. Widely-used subword tokenization approaches favor high-resource languages, and tokenizer-free methods still yield longer sequences for scripts with a higher bytes-per-character ratio. To address these shortcomings, we propose to use the International Phonetic Alphabet (IPA) as a language-agnostic input representation for multilingual tokenizers. IPA provides a compact symbol inventory, greater cross-lingual character overlap, and a more balanced byte-per-character distribution across languages. We train matched pairs of text vs. IPA subword tokenizers across 24 languages and 14 scripts and demonstrate that IPA tokenizers consistently improve tokenization quality, especially for non-Latin scripts, and generalize more effectively to unseen languages and scripts.

Anthology ID:: 2026.acl-long.1872
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 40323–40349
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1872/
DOI:
Bibkey:
Cite (ACL):: Milan Mileti\'c, Julie Kallini, and Ekaterina Shutova. 2026. Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 40323–40349, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet (Mileti'c et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1872.pdf
Checklist:: 2026.acl-long.1872.checklist.pdf

PDF Cite Search Checklist Fix data