Junwen Mo
2026
J-Shuwa: A Large-Scale Web-Collected Japanese Sign Language-Japanese Parallel Corpus
Junwen Mo | MinhDuc Vo | Noriki Nishida | Shin'ichi Satoh | Hideki Nakayama
Findings of the Association for Computational Linguistics: ACL 2026
Junwen Mo | MinhDuc Vo | Noriki Nishida | Shin'ichi Satoh | Hideki Nakayama
Findings of the Association for Computational Linguistics: ACL 2026
Japanese Sign Language (JSL) is a low-resource sign language that has received limited attention in the AI research community, primarily due to the lack of large-scale, publicly available parallel corpora. In this work, we introduce J-Shuwa, a large-scale JSL-Japanese parallel corpus constructed from YouTube videos with hard-coded subtitles and closed captions. The corpus contains 197K parallel JSL-Japanese sentence pairs, totaling approximately 300 hours of video, making it the largest publicly available JSL dataset to date. We conduct sign language translation (SLT) experiments by training models on J-Shuwa and evaluating them on the JSL Dialogue Corpus under both zero-shot and fine-tuned settings. Our results demonstrate that J-Shuwa is effective for training SLT models. Beyond SLT, we believe that J-Shuwa can also serve as a valuable resource for future JSL research across a wide range of tasks. The dataset and code are publicly available at: https://github.com/SpaJune/J-Shuwa.
2025
Improving Sign Language Understanding with a Multi-Stream Masked Autoencoder Trained on ASL Videos
Junwen Mo | MinhDuc Vo | Hideki Nakayama
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Junwen Mo | MinhDuc Vo | Hideki Nakayama
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Sign language understanding remains a significant challenge, particularly for low-resource sign languages with limited annotated data. Motivated by the success of large-scale pretraining in deep learning, we propose Multi-Stream Masked Autoencoder (MS-MAE) — a simple yet effective framework for learning sign language representations from skeleton-based video data. We pretrained a model with MS-MAE on the YouTube-ASL dataset, and then adapted it to multiple downstream tasks across different sign languages. Experimental results show that MS-MAE achieves competitive or superior performance on a range of isolated sign language recognition benchmarks and gloss-free sign language translation tasks across several sign languages. These findings highlight the potential of leveraging large-scale, high-resource sign language data to boost performance in low-resource sign language scenarios. Additionally, visualization of the model’s attention maps reveals its ability to cluster adjacent pose sequences within a sentence, some of which align with individual signs, offering insights into the mechanisms underlying successful transfer learning.