What do tokens know about their characters and how do they know it?

Ayush Kaushal, Kyle Mahowald


Abstract
Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for “cat” encodes that it contains the character “a”). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (Arabic, Devanagari, and Cyrillic). Then, through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire English-language character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.
Anthology ID:
2022.naacl-main.179
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2487–2507
Language:
URL:
https://aclanthology.org/2022.naacl-main.179
DOI:
10.18653/v1/2022.naacl-main.179
Bibkey:
Cite (ACL):
Ayush Kaushal and Kyle Mahowald. 2022. What do tokens know about their characters and how do they know it?. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2487–2507, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
What do tokens know about their characters and how do they know it? (Kaushal & Mahowald, NAACL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2022.naacl-main.179.pdf
Video:
 https://preview.aclanthology.org/emnlp-22-attachments/2022.naacl-main.179.mp4
Code
 ayushk4/character-probing-pytorch
Data
The Pile