Augmenting Sign Language Translation Datasets with Large Language Models
Pedro Alejandro Dal Bianco, Jean Paul Nunes Reinhold, Facundo Manuel Quiroga, Franco Ronchetti
Abstract
Sign language translation (SLT) is a challenging task due to the scarcity of labeled data and the heavy-tailed distribution of sign language vocabularies. In this paper, we explore a novel data augmentation approach for SLT: using a large language model (LLM) to generate paraphrases of the target language sentences in the training data. We experiment with a Transformer-based SLT model (Signformer) on three datasets spanning German, Greek, and Argentinian Sign Languages. For models trained with augmentation, we adopt a two-stage regime: pre-train on the LLM-augmented corpus and then fine-tune on the original, non-augmented training set. Our augmented training sets, expanded with GPT-4-generated paraphrases, yield mixed results. On a medium-scale German SL corpus (PHOENIX14T), LLM augmentation improves BLEU-4 from 9.56 to 10.33. In contrast, a small-vocabulary Greek SL dataset with a near-perfect baseline (94.38 BLEU) sees a slight drop to 92.22 BLEU, and a complex Argentinian SL corpus with a long-tail vocabulary distribution remains around 1.2 BLEU despite augmentation. We analyze these outcomes in relation to each dataset’s complexity and token frequency distribution, finding that LLM-based augmentation is more beneficial when the dataset contains a richer vocabulary and many infrequent tokens. To our knowledge, this work is the first to apply LLM paraphrasing to SLT, and we discuss these results with respect to prior data augmentation efforts in sign language translation.- Anthology ID:
- 2025.wslp-main.4
- Volume:
- Proceedings of the Workshop on Sign Language Processing (WSLP)
- Month:
- December
- Year:
- 2025
- Address:
- IIT Bombay, Mumbai, India (Co-located with IJCNLP–AACL 2025)
- Editors:
- Mohammed Hasanuzzaman, Facundo Manuel Quiroga, Ashutosh Modi, Sabyasachi Kamila, Keren Artiaga, Abhinav Joshi, Sanjeet Singh
- Venues:
- WSLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 20–26
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.wslp-main.4/
- DOI:
- Cite (ACL):
- Pedro Alejandro Dal Bianco, Jean Paul Nunes Reinhold, Facundo Manuel Quiroga, and Franco Ronchetti. 2025. Augmenting Sign Language Translation Datasets with Large Language Models. In Proceedings of the Workshop on Sign Language Processing (WSLP), pages 20–26, IIT Bombay, Mumbai, India (Co-located with IJCNLP–AACL 2025). Association for Computational Linguistics.
- Cite (Informal):
- Augmenting Sign Language Translation Datasets with Large Language Models (Bianco et al., WSLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.wslp-main.4.pdf