Andrew Zisserman
2026
Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
Zifan Jiang | Youngjoon Jang | Liliane Momeni | G\"ul Varol | Sarah Ebling | Andrew Zisserman
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zifan Jiang | Youngjoon Jang | Liliane Momeni | G\"ul Varol | Sarah Ebling | Andrew Zisserman
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video sequence into individual signs and the second to embed each sign video clip into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPU within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing.