Yonghe Lu

Also published as: 永和


2026

Short text clustering has gained significant prominence due to its ubiquity in real-world applications. Despite the recent success of contrastive clustering, existing paradigms still suffer from two critical bottlenecks: (1) conventional data augmentation provides limited semantic granularity and may introduce unintended noise; and (2) the absence of global optimization for cluster assignments often precipitates the accumulation of pseudo-label noise, thereby compromising semantic consistency. To bridge these gaps, we propose MAST, a Multi-view Alignment Strategy with Transport-based clustering. MAST constructs complementary structural views to capture multi-granularity semantic features and introduces a multi-view contrastive objective that jointly aligns original, augmented, and structure-enhanced embeddings. To mitigate representation over-smoothing, we incorporate structure-aware negative reweighting and intermediate-layer negative sampling. Furthermore, MAST employs high-confidence guided refinement and an optimal transport-based pseudo-label alignment mechanism to enforce global semantic consistency across multiple views. Extensive experiments on several benchmark datasets demonstrate that MAST consistently outperforms state-of-the-art methods, establishing a new competitive baseline for short text clustering.

2025

Short texts pose significant challenges for clustering due to semantic sparsity, limited context, and fuzzy category boundaries. Although recent contrastive learning methods improve instance-level representation, they often overlook local semantic structure within the clustering head. Moreover, treating semantically similar neighbors as negatives impair cluster-level discrimination. To address these issues, we propose Fuzzy Neighborhood-Aware Self-Supervised Contrastive Clustering (FNSCC) framework. FNSCC incorporates neighborhood information at both the instance-level and cluster-level. At the instance-level, it excludes neighbors from the negative sample set to enhance inter-cluster separability. At the cluster-level, it introduces fuzzy neighborhood-aware weighting to refine soft assignment probabilities, encouraging alignment with semantically coherent clusters. Experiments on multiple benchmark short text datasets demonstrate that FNSCC consistently outperforms state-of-the-art models in accuracy and normalized mutual information. Our code is available at https://github.com/zjzone/FNSCC.

2023

“一直以来,专利相似度计算和比较等工作都由专利审查员人工进行并做出准确判断。然而,以人工方式分析和研判专利的原创性、实用性以及是否侵权等工作需要投入大量的人力物力资源且效率较低。基于此,本文将ALBERT预训练模型用于专利的文本表示,并通过引入Synonyms近义词库增强专利文本的语义表达能力,探索一种基于语义知识库和深度学习的专利文本表示模型与相似度计算方法。实验结果表明,加入Synonyms近义词库消歧后的专利文本相似性度量的实验准确率有一定的提升。”