Statistically Optimized SGNS Model: Enhancing Word Vector Representation with Global Semantic Weight

Yulin Liu, Xiong Feng, Wanwei Liu, Wu Minghui


Abstract
"Addressing the limitations of the Skip-gram with Negative Sampling (SGNS) model related to negative sampling, subsampling, and its fixed context window mechanism, this paper first presents an in-depth statistical analysis of the optimal solution for SGNS matrix factorization,deriving the theoretically optimal distribution for negative sampling. Building upon this analysis, we propose the concept of Global Semantic Weight (GSW), derived from Pointwise Mutual Information (PMI). We integrate GSW with word frequency information to improve the effectiveness of both negative sampling and subsampling. Furthermore, we design dynamic adjustment mechanisms for the context window size and the number of negative samples based on GSW, enabling the model to adaptively capture contextual information commensurate with the semantic importance of the center word. Notably, our optimized model maintains the sametime complexity as the original SGNS implementation. Experimental results demonstrate that our proposed model achieves competitive performance aganist state-of-the-art word embedding models including SGNS, CBOW, and GloVe, across multiple benchmark tasks.Compared with the current mainstream dynamic word vector models, this work emphasizes achieving a balance between efficiency and performance within a static embedding framework, and provides potential supplementation and support for complex models such as LLMs."
Anthology ID:
2025.ccl-1.74
Volume:
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
Month:
August
Year:
2025
Address:
Jinan, China
Editors:
Maosong Sun, Peiyong Duan, Zhiyuan Liu, Ruifeng Xu, Weiwei Sun
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
972–984
Language:
URL:
https://preview.aclanthology.org/ingest-ccl/2025.ccl-1.74/
DOI:
Bibkey:
Cite (ACL):
Yulin Liu, Xiong Feng, Wanwei Liu, and Wu Minghui. 2025. Statistically Optimized SGNS Model: Enhancing Word Vector Representation with Global Semantic Weight. In Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025), pages 972–984, Jinan, China. Chinese Information Processing Society of China.
Cite (Informal):
Statistically Optimized SGNS Model: Enhancing Word Vector Representation with Global Semantic Weight (Liu et al., CCL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ccl/2025.ccl-1.74.pdf