Haoyu Shi


2026

Recently, human motion understanding has been a prominent area of research due to its critical importance in many fields. The key to advancing this understanding lies in the precise alignment between motion and linguistic modalities. Existing methods mainly follow two paradigms: global contrastive alignment and vocabulary space-based alignment. However, motion sequences exhibit sequential spatiotemporal dynamics while text conveys abstract semantics, leading to a fundamental mismatch in semantic levels and granularities. This undermines cross-modal alignment and results in suboptimal downstream performance. To alleviate this, we introduce a modality-shared codebook that enables unified representation learning and precise alignment of motion and linguistic modalities. Each codeword in the codebook is regularized to encode cross-modality shared semantics, and we leverage sparse activation and distribution consistency loss to enforce matched motion and text are represented by the same set of codewords. Additionally, we introduce a locality-aware Gaussian encoder to refine pose features and design a hard-negative guided loss to strengthen alignment discriminability. Extensive experiments across various language-motion evaluation, including text-motion retrieval, text-motion grounding, and motion caption, demonstrate that our model significantly surpasses current state-of-the-art methods.