Zhengmao Ye


2026

Although the Universal Transformer (UT) mitigates the diminishing returns of standard LLM scaling by decoupling parameter count from depth, it remains constrained by linear computational costs and rigid weight-sharing mechanisms. These limitations lead to severe functional homogeneity, which subsequently induces over-smoothing, representation rank collapse, and degraded reasoning performance. In this work, we present the first systematic study of Compute Distribution Skew, identifying it as the primary driver of extrapolation failure. This is a pathological phenomenon in ultra-deep recurrent Transformers characterized by a disproportionate distribution of contributions across recurrent steps, resulting in distinct functional states during prefix and suffix processing phases. To address this challenge, we propose the Polymorphic Transformer, which aims to achieve functional polymorphism and depth sparsity within a shared-parameter framework. By integrating conditional sparse subspaces, SiLU Attention, and an uncertainty-aware depth scheduler, our architecture mitigates power-method collapse and effectively decouples logical depth from computational cost. Experiments demonstrate that our model significantly enhances representation rank and robustness, achieving complex reasoning performance comparable to baseline while reducing computation by 64.7%.

2025

Natural language transformation (NLT) tasks, such as machine translation (MT) and text style transfer (TST), require models to generate accurate and contextually appropriate outputs. However, existing approaches face significant challenges, including the computational costs of leveraging large pre-trained models and the limited generalization ability of fine-tuned smaller models. In this paper, we propose a novel framework that combines the flexibility of prompting with the cost-effectiveness of fine-tuning. Our method enhances smaller models by integrating In-Context Examples (ICE) from retrieval, enabling the model to better capture contextual information and align with user-level preferences. We further improve performance through hierarchical contrastive learning and dynamic preference inference mechanisms. Experimental results demonstrate that our approach outperforms existing methods, such as Supervised Fine Tuning (SFT), Direct Preference Optimization (DPO), and Contrastive Preference Optimization (CPO), across both MT and TST tasks, providing a more efficient solution for resource-constrained environments.