Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production

Maoxiao Ye; Xinfeng Ye; Sathiamoorthy Manoharan

Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production

Maoxiao Ye, Xinfeng Ye, Sathiamoorthy Manoharan

Abstract

Earlier Sign Language Production (SLP) models typically relied on autoregressive methods that generate output tokens one by one, which inherently provide temporal alignment. Although techniques like Teacher Forcing can prevent model collapse during training, they still cannot solve the problem of error accumulation during inference, since ground truth is unavailable at that stage. In contrast, more recent approaches based on diffusion models leverage step-by-step denoising to enable high-quality generation. However, the iterative nature of these models and the requirement to denoise entire sequences limit their applicability in real-time tasks like SLP. To address it, we propose a hybrid autoregressive-diffusion model for Sign Language Production (SLP), combining sequential dependency modeling with iterative refinement. A Multi-Scale Pose Representation module captures fine-grained articulator features, while a Confidence-Aware Causal Attention mechanism guides generation using joint-level confidence scores. Experiments on PHOENIX14T and How2Sign show improved generation quality and real-time efficiency.

Anthology ID:: 2026.acl-long.31
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 750–763
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.31/
DOI:
Bibkey:
Cite (ACL):: Maoxiao Ye, Xinfeng Ye, and Sathiamoorthy Manoharan. 2026. Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 750–763, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production (Ye et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.31.pdf
Checklist:: 2026.acl-long.31.checklist.pdf

PDF Cite Search Checklist Fix data