A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

Md Mofijul Islam; Gustavo Aguilar; Pragaash Ponnusamy; Clint Solomon Mathialagan; Chengyuan Ma; Chenlei Guo

doi:10.18653/v1/2022.repl4nlp-1.10

A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

Md Mofijul Islam, Gustavo Aguilar, Pragaash Ponnusamy, Clint Solomon Mathialagan, Chengyuan Ma, Chenlei Guo

Abstract

Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models’ ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in low-resource languages, leading models to produce suboptimal representations. Additionally, the dependency on a fixed vocabulary limits the subword models’ adaptability across languages and domains. In this work, we propose a vocabulary-free neural tokenizer by distilling segmentation information from heuristic-based subword tokenization. We pre-train our character-based tokenizer by processing unique words from multilingual corpus, thereby extensively increasing word diversity across languages. Unlike the predefined and fixed vocabularies in subword methods, our tokenizer allows end-to-end task learning, resulting in optimal task-specific tokenization. The experimental results show that replacing the subword tokenizer with our neural tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks, with larger gains in low-resource languages. Additionally, our neural tokenizer exhibits a robust performance on downstream tasks when adversarial noise is present (typos and misspelling), further increasing the initial improvements over statistical subword tokenizers.

Anthology ID:: 2022.repl4nlp-1.10
Volume:: Proceedings of the 7th Workshop on Representation Learning for NLP
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Venue:: RepL4NLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 91–99
Language:
URL:: https://aclanthology.org/2022.repl4nlp-1.10
DOI:: 10.18653/v1/2022.repl4nlp-1.10
Bibkey:
Cite (ACL):: Md Mofijul Islam, Gustavo Aguilar, Pragaash Ponnusamy, Clint Solomon Mathialagan, Chengyuan Ma, and Chenlei Guo. 2022. A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning. In Proceedings of the 7th Workshop on Representation Learning for NLP, pages 91–99, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning (Mofijul Islam et al., RepL4NLP 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/2022.repl4nlp-1.10.pdf
Video:: https://preview.aclanthology.org/ingestion-script-update/2022.repl4nlp-1.10.mp4

PDF Search Video