DVAGen: Dynamic Vocabulary Augmented Generation

Wei Du, Nuowei Liu, Jie Wang, Jiahao Kuang, Tao Ji, Xiaoling Wang, Yuanbin Wu


Abstract
Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.
Anthology ID:
2025.emnlp-demos.26
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Ivan Habernal, Peter Schulam, Jörg Tiedemann
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
365–372
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-demos.26/
DOI:
Bibkey:
Cite (ACL):
Wei Du, Nuowei Liu, Jie Wang, Jiahao Kuang, Tao Ji, Xiaoling Wang, and Yuanbin Wu. 2025. DVAGen: Dynamic Vocabulary Augmented Generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 365–372, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
DVAGen: Dynamic Vocabulary Augmented Generation (Du et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-demos.26.pdf