Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Jonathan H. Clark; Dan Garrette; Iulia Turc; John Wieting

doi:10.1162/tacl_a_00448

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting

Abstract

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.

Anthology ID:: 2022.tacl-1.5
Volume:: Transactions of the Association for Computational Linguistics, Volume 10
Month:
Year:: 2022
Address:: Cambridge, MA
Editors:: Brian Roark, Ani Nenkova
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 73–91
Language:
URL:: https://preview.aclanthology.org/nschneid-patch-2/2022.tacl-1.5/
DOI:: 10.1162/tacl_a_00448
Bibkey:
Cite (ACL):: Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. 2022. Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. Transactions of the Association for Computational Linguistics, 10:73–91.
Cite (Informal):: Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation (Clark et al., TACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-2/2022.tacl-1.5.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-2/2022.tacl-1.5.mp4

PDF Cite Search Video Fix data