Understanding Subword Compositionality of Large Language Models

Qiwei Peng; Yekun Chai; Anders Søgaard

Understanding Subword Compositionality of Large Language Models

Abstract

Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.

Anthology ID:: 2025.emnlp-main.1146
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22535–22546
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1146/
DOI:
Bibkey:
Cite (ACL):: Qiwei Peng, Yekun Chai, and Anders Søgaard. 2025. Understanding Subword Compositionality of Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22535–22546, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Understanding Subword Compositionality of Large Language Models (Peng et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1146.pdf
Checklist:: 2025.emnlp-main.1146.checklist.pdf

PDF Cite Search Checklist Fix data