TriEmbed: Bridge the Gap between Text and Token Indices with Embedding Reparameterization

Baizhou Huang, Xiaojun Wan


Abstract
The current paradigm of language modeling is a two-stage pipeline that first transforms raw text to token indices, where the distribution is then estimated. It inherently discards linguistic relations between tokens during tokenization, creating a fundamental gap. To address this, we propose TriEmbed, a reparameterization method for embeddings that incorporates the morphological relationships inherent in subword tokenizer algorithms. Specifically, by organizing the vocabulary into a Trie structure, we can encode these relations and reparametrize the embeddings, facilitating the recovery of other linguistic relationships during training. Empirical results across various settings demonstrate that TriEmbed outperforms conventional embeddings from the perspective of scaling, while offering more linguistically informative token embeddings.
Anthology ID:
2025.findings-acl.275
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5291–5297
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.275/
DOI:
Bibkey:
Cite (ACL):
Baizhou Huang and Xiaojun Wan. 2025. TriEmbed: Bridge the Gap between Text and Token Indices with Embedding Reparameterization. In Findings of the Association for Computational Linguistics: ACL 2025, pages 5291–5297, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
TriEmbed: Bridge the Gap between Text and Token Indices with Embedding Reparameterization (Huang & Wan, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.275.pdf