Context-Driven and Reference-Guided Data Augmentation for Subtitle Translation

Hitoshi Ito, Naoto Shirai, Kazutaka Kinugawa, Hideya Mino, Rei Endo, Yoshihiko Kawai


Abstract
Large language models (LLMs) have demonstrated strong performance in translation tasks. Subtitle translation presents unique challenges, such as preserving the original work’s worldview and the distinctive speaking styles of its characters. Achieving high-quality translations that reflect these stylistic nuances typically requires bilingual data for a specific movie, which is often scarce or unavailable. Thus, we propose a data augmentation method that uses LLMs to improve translation performance for specific movies, even when only a few hundred bilingual sentence pairs are available. The method expands source-side data by rewriting original subtitles using information that can be extracted from the context, such as character profiles and scene descriptions, to maintain the tone and thematic consistency of the movie. For translation, the augmented sentences are aligned with manually translated originals using structural similarity, which enables style-preserving bilingual data generation via one-shot learning. Experimental results show that data augmented using the proposed method effectively improves BLEU scores for film subtitle translation, and achieves superior stylistic quality in human evaluation.
Anthology ID:
2026.findings-acl.2059
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
41381–41394
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2059/
DOI:
Bibkey:
Cite (ACL):
Hitoshi Ito, Naoto Shirai, Kazutaka Kinugawa, Hideya Mino, Rei Endo, and Yoshihiko Kawai. 2026. Context-Driven and Reference-Guided Data Augmentation for Subtitle Translation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 41381–41394, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Context-Driven and Reference-Guided Data Augmentation for Subtitle Translation (Ito et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2059.pdf
Checklist:
 2026.findings-acl.2059.checklist.pdf