MSCode: Advancing Human Motion-Language Understanding via Modality-Shared Codebook

Haoyu Shi, Huaiwen Zhang


Abstract
Recently, human motion understanding has been a prominent area of research due to its critical importance in many fields. The key to advancing this understanding lies in the precise alignment between motion and linguistic modalities. Existing methods mainly follow two paradigms: global contrastive alignment and vocabulary space-based alignment. However, motion sequences exhibit sequential spatiotemporal dynamics while text conveys abstract semantics, leading to a fundamental mismatch in semantic levels and granularities. This undermines cross-modal alignment and results in suboptimal downstream performance. To alleviate this, we introduce a modality-shared codebook that enables unified representation learning and precise alignment of motion and linguistic modalities. Each codeword in the codebook is regularized to encode cross-modality shared semantics, and we leverage sparse activation and distribution consistency loss to enforce matched motion and text are represented by the same set of codewords. Additionally, we introduce a locality-aware Gaussian encoder to refine pose features and design a hard-negative guided loss to strengthen alignment discriminability. Extensive experiments across various language-motion evaluation, including text-motion retrieval, text-motion grounding, and motion caption, demonstrate that our model significantly surpasses current state-of-the-art methods.
Anthology ID:
2026.findings-acl.1901
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38116–38132
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1901/
DOI:
Bibkey:
Cite (ACL):
Haoyu Shi and Huaiwen Zhang. 2026. MSCode: Advancing Human Motion-Language Understanding via Modality-Shared Codebook. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38116–38132, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MSCode: Advancing Human Motion-Language Understanding via Modality-Shared Codebook (Shi & Zhang, Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1901.pdf
Checklist:
 2026.findings-acl.1901.checklist.pdf