Investigating the Effectiveness of BPE: The Power of Shorter Sequences

Matthias Gallé

doi:10.18653/v1/D19-1141

Investigating the Effectiveness of BPE: The Power of Shorter Sequences

Abstract

Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to the broader family of dictionary-based compression algorithms and compare it with other members of this family. Our experiments across datasets, language pairs, translation models, and vocabulary size show that - given a fixed vocabulary size budget - the fewer tokens an algorithm needs to cover the test set, the better the translation (as measured by BLEU).

Anthology ID:: D19-1141
Volume:: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:: November
Year:: 2019
Address:: Hong Kong, China
Venues:: EMNLP | IJCNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1375–1381
Language:
URL:: https://aclanthology.org/D19-1141
DOI:: 10.18653/v1/D19-1141
Bibkey:
Cite (ACL):: Matthias Gallé. 2019. Investigating the Effectiveness of BPE: The Power of Shorter Sequences. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1375–1381, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: Investigating the Effectiveness of BPE: The Power of Shorter Sequences (Gallé, EMNLP-IJCNLP 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/D19-1141.pdf

PDF Search