English-to-Japanese Multimodal Machine Translation Based on Image-Text Matching of Lecture Videos
Ayu Teramen, Takumi Ohtsuka, Risa Kondo, Tomoyuki Kajiwara, Takashi Ninomiya
Abstract
We work on a multimodal machine translation of the audio contained in English lecture videos to generate Japanese subtitles. Image-guided multimodal machine translation is promising for error correction in speech recognition and for text disambiguation. In our situation, lecture videos provide a variety of images. Images of presentation materials can complement information not available from audio and may help improve translation quality. However, images of speakers or audiences would not directly affect the translation quality. We construct a multimodal parallel corpus with automatic speech recognition text and multiple images for a transcribed parallel corpus of lecture videos, and propose a method to select the most relevant ones from the multiple images with the speech text for improving the performance of image-guided multimodal machine translation. Experimental results on translating automatic speech recognition or transcribed English text into Japanese show the effectiveness of our method to select a relevant image.- Anthology ID:
- 2024.alvr-1.7
- Volume:
- Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Jing Gu, Tsu-Jui (Ray) Fu, Drew Hudson, Asli Celikyilmaz, William Wang
- Venues:
- ALVR | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 86–91
- Language:
- URL:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2024.alvr-1.7/
- DOI:
- 10.18653/v1/2024.alvr-1.7
- Cite (ACL):
- Ayu Teramen, Takumi Ohtsuka, Risa Kondo, Tomoyuki Kajiwara, and Takashi Ninomiya. 2024. English-to-Japanese Multimodal Machine Translation Based on Image-Text Matching of Lecture Videos. In Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 86–91, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- English-to-Japanese Multimodal Machine Translation Based on Image-Text Matching of Lecture Videos (Teramen et al., ALVR 2024)
- PDF:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2024.alvr-1.7.pdf