Corpus-Dependent Subcharacter Encoding via HMM-Guided Code Assignment

Tatsuya Hiraoka


Abstract
We propose a corpus-dependent alternative to byte encoding that learns fixed-length atomic codes for characters directly from text, which we refer to as Latom (Learned Atom-based Encoding).We instantiate this framework by training an HMM on N-repeated character sequences to estimate "atom" posteriors, followed by a Hungarian assignment yielding a globally optimal one-to-one character-code mapping.Across 14 languages, the encodings improve intrinsic metrics, including token counts after subword tokenization and bigram perplexity, with appropriate code lengths.On Amazon Reviews in six languages, Latom improves text classification accuracy and reduces decoding errors in language model generation.Overall, these results demonstrate that character encodings can be learned from corpus statistics while remaining reversible and compatible with standard tokenization pipelines.
Anthology ID:
2026.acl-long.1596
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
34569–34593
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1596/
DOI:
Bibkey:
Cite (ACL):
Tatsuya Hiraoka. 2026. Corpus-Dependent Subcharacter Encoding via HMM-Guided Code Assignment. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34569–34593, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Corpus-Dependent Subcharacter Encoding via HMM-Guided Code Assignment (Hiraoka, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1596.pdf
Checklist:
 2026.acl-long.1596.checklist.pdf