EmByte: Decomposition and Compression Learning for Small yet Private NLP

Shenglan Li, Jia Xu, Mengjiao Zhang


Abstract
Recent breakthroughs in natural language processing (NLP) have come with escalating model sizes and computational costs, posing significant challenges for deployment in real-time and resource-constrained environments. We introduce EMBYTE, a novel byte-level tokenization model that achieves substantial embedding compression while preserving NLP accuracy and enhancing privacy. At the core of EMBYTE is a new Decompose-and-Compress (DeComp) learning strategy that decomposes subwords into fine-grained byte embeddings and then compresses them via neural projection. DeComp enables EMBYTE to be shrunk down to any vocabulary size (e.g., 128 or 256), drastically reducing embedding parameter count by up to 94% compared to subword-based models without increasing sequence length or degrading performance. Moreover, EMBYTE is resilient to privacy threats such as gradient inversion attacks, due to its byte-level many-to-one mapping structure. Empirical results on GLUE, machine translation, sentiment analysis, and language modeling tasks show that EMBYTE matches or surpasses the performance of significantly larger models, while offering improved efficiency. This makes EMBYTE a lightweight and generalizable NLP solution, well-suited for deployment in privacy-sensitive or low-resource environments.
Anthology ID:
2025.findings-emnlp.379
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7182–7201
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.379/
DOI:
10.18653/v1/2025.findings-emnlp.379
Bibkey:
Cite (ACL):
Shenglan Li, Jia Xu, and Mengjiao Zhang. 2025. EmByte: Decomposition and Compression Learning for Small yet Private NLP. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 7182–7201, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
EmByte: Decomposition and Compression Learning for Small yet Private NLP (Li et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.379.pdf
Checklist:
 2025.findings-emnlp.379.checklist.pdf