Accelerating Multilingual Language Model for Excessively Tokenized Languages

Jimin Hong; Gibbeum Lee; Jaewoong Cho

Accelerating Multilingual Language Model for Excessively Tokenized Languages

Abstract

Recent advancements in large language models (LLMs) have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages, leading to inefficient text generation.We introduce a simple yet effective framework to accelerate text generation in such languages. Our approach involves employing a new language model head with a vocabulary set tailored to a specific target language for a pre-trained LLM. This is followed by fine-tuning the new head while incorporating a verification step to ensure the model’s performance is preserved.We show that this targeted fine-tuning, while freezing other model parameters, effectively reduces token fragmentation for the target language. Our extensive experiments demonstrate that the proposed framework increases the generation speed by a factor of 1.7 while maintaining the performance of pre-trained multilingual models on target monolingual tasks.

Anthology ID:: 2024.findings-acl.660
Volume:: Findings of the Association for Computational Linguistics ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand and virtual meeting
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11095–11111
Language:
URL:: https://aclanthology.org/2024.findings-acl.660
DOI:
Bibkey:
Cite (ACL):: Jimin Hong, Gibbeum Lee, and Jaewoong Cho. 2024. Accelerating Multilingual Language Model for Excessively Tokenized Languages. In Findings of the Association for Computational Linguistics ACL 2024, pages 11095–11111, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Accelerating Multilingual Language Model for Excessively Tokenized Languages (Hong et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2024.findings-acl.660.pdf

PDF Search