Local Byte Fusion for Neural Machine Translation
Makesh Narsimhan Sreedhar, Xiangpeng Wan, Yu Cheng, Junjie Hu
Abstract
Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus may not adapt well to other parallel corpora. It has also been observed that in multilingual corpora, subword tokenization schemes oversegment low-resource languages, leading to a drop in translation performance. An alternative to subword tokenizers is byte-based tokenization, i.e., tokenization into byte sequences using the UTF-8 encoding scheme. Byte tokens often represent inputs at a sub-character granularity, i.e., one character can be represented by a span of byte tokens. This results in much longer byte sequences that are hard to interpret without aggregating local information from multiple byte tokens. In this paper, we propose a Local Byte Fusion (LOBEF) method for byte-based machine translation—utilizing byte n-gram and word boundaries—to aggregate local semantic information. Extensive experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over vanilla byte-based models. Further analysis also indicates that our byte-based models are parameter-efficient and perform competitive to subword models.- Anthology ID:
- 2023.acl-long.397
- Volume:
- Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7199–7214
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2023.acl-long.397/
- DOI:
- 10.18653/v1/2023.acl-long.397
- Cite (ACL):
- Makesh Narsimhan Sreedhar, Xiangpeng Wan, Yu Cheng, and Junjie Hu. 2023. Local Byte Fusion for Neural Machine Translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7199–7214, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Local Byte Fusion for Neural Machine Translation (Sreedhar et al., ACL 2023)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2023.acl-long.397.pdf