Anal Roy Chowdhury
2026
Can Small Vision–Language Models Perform Sign Language Translation?
Anal Roy Chowdhury | Debarshi Kumar Sanyal
Findings of the Association for Computational Linguistics: ACL 2026
Anal Roy Chowdhury | Debarshi Kumar Sanyal
Findings of the Association for Computational Linguistics: ACL 2026
Vision-Language Models (VLMs) have shown strong generalization across multimodal tasks, but their capacity to handle sign language translation (SLT), which requires fine-grained spatiotemporal reasoning and linguistic understanding, remains unclear. In this study, we evaluate whether small VLMs (with ≤3B parameters) can perform SLT effectively. We perform supervised fine-tuning on four publicly available multilingual SLT datasets, including one German (DGS), two American (ASL), and one Indian (ISL), applying parameter-efficient LoRA to the language decoder while keeping the vision encoder frozen and training only the connector. To evaluate translation quality, we propose entity- and semantics-aware metrics tailored for SLT. We highlight the data imbalance issues present in the above widely used SLT datasets. Our analysis highlights the limitations in applying general-purpose VLMs to SLT, unlike their applicability in other tasks, and provides insights to inform future development of VLMs for SLP, which is essential for building inclusive AI applications.
2025
Enhancing Indian Sign Language Translation via Motion-Aware Modeling
Anal Roy Chowdhury | Debarshi Kumar Sanyal
Proceedings of the Workshop on Sign Language Processing (WSLP)
Anal Roy Chowdhury | Debarshi Kumar Sanyal
Proceedings of the Workshop on Sign Language Processing (WSLP)
Sign language translation (SLT) has witnessed rapid progress in the deep learning community across several sign languages, including German, American, British, and Italian. However, Indian Sign Language (ISL) remains relatively underexplored. Motivated by recent efforts to develop large-scale ISL resources, we investigate how existing SLT models perform on ISL data. Specifically, we evaluate three approaches: (i) training a transformer-based model, (ii) leveraging visual-language pretraining, and (iii) tuning a language model with pre-trained visual and motion representations. Unlike existing methods that primarily use raw video frames, we augment the model with optical flow maps to explicitly capture motion primitives, combined with a multi-scale feature extraction method for encoding spatial features (SpaMo-OF). Our approach achieves promising results, obtaining a BLEU-4 score of 8.58 on the iSign test set, establishing a strong baseline for future ISL translation research.