An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

Marco Cognetta, Tatsuya Hiraoka, Rico Sennrich, Yuval Pinter, Naoaki Okazaki


Abstract
We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a tokenization postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in model implementations, both as a means to reduce model size and for improving model performance through robustness, our experiments indicate that, across a large space of hyperparameter settings, vocabulary trimming fails to consistently improve model performance, and is even prone to incurring heavy degradation.
Anthology ID:
2024.insights-1.7
Volume:
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Shabnam Tafreshi, Arjun Akula, João Sedoc, Aleksandr Drozd, Anna Rogers, Anna Rumshisky
Venues:
insights | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
48–50
Language:
URL:
https://aclanthology.org/2024.insights-1.7
DOI:
Bibkey:
Cite (ACL):
Marco Cognetta, Tatsuya Hiraoka, Rico Sennrich, Yuval Pinter, and Naoaki Okazaki. 2024. An Analysis of BPE Vocabulary Trimming in Neural Machine Translation. In Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 48–50, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation (Cognetta et al., insights-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-checklist/2024.insights-1.7.pdf