Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Zifan Jiang, Youngjoon Jang, Liliane Momeni, G\"ul Varol, Sarah Ebling, Andrew Zisserman


Abstract
The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video sequence into individual signs and the second to embed each sign video clip into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPU within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing.
Anthology ID:
2026.acl-long.1401
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30371–30384
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1401/
DOI:
Bibkey:
Cite (ACL):
Zifan Jiang, Youngjoon Jang, Liliane Momeni, G\"ul Varol, Sarah Ebling, and Andrew Zisserman. 2026. Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30371–30384, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing (Jiang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1401.pdf
Checklist:
 2026.acl-long.1401.checklist.pdf