Languages Through the Looking Glass of BPE Compression

Ximena Gutierrez-Vasques, Christian Bentz, Tanja Samardžić


Abstract
Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It uncovers redundant patterns for compressing the data, and hence alleviates the sparsity problem in downstream applications. Subwords discovered during the first merge operations tend to have the most substantial impact on the compression of texts. However, the structural underpinnings of this effect have not been analyzed cross-linguistically. We conduct in-depth analyses across 47 typologically diverse languages and three parallel corpora, and thereby show that the types of recurrent patterns that have the strongest impact on compression are an indicator of morphological typology. For languages with richer inflectional morphology there is a preference for highly productive subwords on the early merges, while for languages with less inflectional morphology, idiosyncratic subwords are more prominent. Both types of patterns contribute to efficient compression. Counter to the common perception that BPE subwords are not linguistically relevant, we find patterns across languages that resemble those described in traditional typology. We thus propose a novel way to characterize languages according to their BPE subword properties, inspired by the notion of morphological productivity in linguistics. This allows us to have language vectors that encode typological knowledge induced from raw text. Our approach is easily applicable to a wider range of languages and texts, as it does not require annotated data or any external linguistic knowledge. We discuss its potential contributions to quantitative typology and multilingual NLP.
Anthology ID:
2023.cl-4.5
Volume:
Computational Linguistics, Volume 49, Issue 4 - December 2023
Month:
December
Year:
2023
Address:
Cambridge, MA
Venue:
CL
SIG:
Publisher:
MIT Press
Note:
Pages:
943–1001
Language:
URL:
https://aclanthology.org/2023.cl-4.5
DOI:
10.1162/coli_a_00489
Bibkey:
Cite (ACL):
Ximena Gutierrez-Vasques, Christian Bentz, and Tanja Samardžić. 2023. Languages Through the Looking Glass of BPE Compression. Computational Linguistics, 49(4):943–1001.
Cite (Informal):
Languages Through the Looking Glass of BPE Compression (Gutierrez-Vasques et al., CL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/revert-3132-ingestion-checklist/2023.cl-4.5.pdf