ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Linting Xue; Aditya Barua; Noah Constant; Rami Al-Rfou’; Sharan Narang; Mihir Kale; Adam Roberts; Colin Raffel

doi:10.1162/tacl_a_00461

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

Abstract

Most widely used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: They can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Because byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.1

Anthology ID:: 2022.tacl-1.17
Volume:: Transactions of the Association for Computational Linguistics, Volume 10
Month:
Year:: 2022
Address:: Cambridge, MA
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 291–306
Language:
URL:: https://aclanthology.org/2022.tacl-1.17
DOI:: 10.1162/tacl_a_00461
Bibkey:
Cite (ACL):: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics, 10:291–306.
Cite (Informal):: ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models (Xue et al., TACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/paclic-22-ingestion/2022.tacl-1.17.pdf
Video:: https://preview.aclanthology.org/paclic-22-ingestion/2022.tacl-1.17.mp4

PDF Search Video