Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

Gunjan Balde; Soumyadeep Roy; Mainack Mondal; Niloy Ganguly

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

Abstract

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (**OOV**) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by 35-55% over continual pretraining and reduce parameter counts up to 37% w.r.t expansion-only methods. We make codebase publicly available at [url](https://github.com/gb-kgp/VocabReplace-Then-Expand).

Anthology ID:: 2026.acl-long.2088
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 45071–45086
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2088/
DOI:
Bibkey:
Cite (ACL):: Gunjan Balde, Soumyadeep Roy, Mainack Mondal, and Niloy Ganguly. 2026. Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 45071–45086, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization (Balde et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2088.pdf
Checklist:: 2026.acl-long.2088.checklist.pdf

PDF Cite Search Checklist Fix data