On the Correspondence between the Squared Norm and Information Content in Text Embeddings
Enrique Amigo, Adrian Ghajari, Alejandro Benito-Santos, Diego De La Fuente Rodríguez
Abstract
Previous work has reported both empirical and theoretical evidence, for specific training models, of the correspondence between the squared norm of an embedding and the information content of the text it represents.In this paper, we investigate the relationship at the theoretical and empirical levels, focusing on the mechanisms and composition functions used to combine token embeddings. i) We formally derive two sufficient theoretical conditions for this correspondence to hold in embedding models. ii) We empirically examine the correspondence and the validity of these conditions at the word level for both static and contextual embeddings and different subword token composition mechanisms.iii) Building on Shannon’s Constant Entropy Rate (CER) principle, we explore whether embedding mechanisms exhibit a linearly monotonic increase in information content as text length increases.Our formal analysis and experiments reveal that:i) At the word embedding level, models satisfy the sufficient conditions and show a strong correspondence when certain subword composition functions are applied.ii) Only scaled embedding averages proposed in this paper and certain information-theoretic composition functions preserve the correspondence. Some non-compositional representations—such as the CLS token in BERT or the EOS token in LLaMA—tend to converge toward a fixed point. The CLS token in ModernBERT, however, exhibits behavior that aligns more closely with the CER hypothesis.- Anthology ID:
- 2025.findings-emnlp.734
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13631–13643
- Language:
- URL:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.734/
- DOI:
- 10.18653/v1/2025.findings-emnlp.734
- Cite (ACL):
- Enrique Amigo, Adrian Ghajari, Alejandro Benito-Santos, and Diego De La Fuente Rodríguez. 2025. On the Correspondence between the Squared Norm and Information Content in Text Embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 13631–13643, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- On the Correspondence between the Squared Norm and Information Content in Text Embeddings (Amigo et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.734.pdf