Abstract
Machine translation (MT) is an important task in natural language processing, which aims to translate a sentence in a source language to another sentence with the same/similar semantics in a target language. Despite the huge effort on building MT systems for different language pairs, most previous work focuses on formal-language settings, where text to be translated come from written sources such as books and news articles. As a result, such MT systems could fail to translate livestreaming video transcripts, where text is often shorter and might be grammatically incorrect. To overcome this issue, we introduce a novel MT corpus - BehanceMT for livestreaming video transcript translation. Our corpus contains parallel transcripts for 3 language pairs, where English is the source language and Spanish, Chinese, and Arabic are the target languages. Experimental results show that finetuning a pretrained MT model on BehanceMT significantly improves the performance of the model in translating video transcripts across 3 language pairs. In addition, the finetuned MT model outperforms GoogleTranslate in 2 out of 3 language pairs, further demonstrating the usefulness of our proposed dataset for video transcript translation. BehanceMT will be publicly released upon the acceptance of the paper.- Anthology ID:
- 2022.tu-1.4
- Volume:
- Proceedings of the First Workshop On Transcript Understanding
- Month:
- Oct
- Year:
- 2022
- Address:
- Gyeongju, South Korea
- Venue:
- TU
- SIG:
- Publisher:
- International Conference on Computational Linguistics
- Note:
- Pages:
- 30–33
- Language:
- URL:
- https://aclanthology.org/2022.tu-1.4
- DOI:
- Cite (ACL):
- Minh Van Nguyen, Franck Dernoncourt, and Thien Nguyen. 2022. BehanceMT: A Machine Translation Corpus for Livestreaming Video Transcripts. In Proceedings of the First Workshop On Transcript Understanding, pages 30–33, Gyeongju, South Korea. International Conference on Computational Linguistics.
- Cite (Informal):
- BehanceMT: A Machine Translation Corpus for Livestreaming Video Transcripts (Nguyen et al., TU 2022)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2022.tu-1.4.pdf