Document-aligned Japanese-English Conversation Parallel Corpus

Matīss Rikters; Ryokan Ri; Tong Li; Toshiaki Nakazawa

Document-aligned Japanese-English Conversation Parallel Corpus

Matīss Rikters, Ryokan Ri, Tong Li, Toshiaki Nakazawa

Abstract

Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main methods and data sets focus on SL evaluation. To address the first issue, we present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing. As for the second issue, we manually identify the main areas where SL MT fails to produce adequate translations in lack of context. We then create an evaluation set where these phenomena are annotated to alleviate automatic evaluation of DL systems. We train MT models using our corpus to demonstrate how using context leads to improvements.

Anthology ID:: 2020.wmt-1.74
Volume:: Proceedings of the Fifth Conference on Machine Translation
Month:: November
Year:: 2020
Address:: Online
Venues:: EMNLP | WMT
SIG:: SIGMT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 639–645
Language:
URL:: https://aclanthology.org/2020.wmt-1.74
DOI:
Bibkey:
Cite (ACL):: Matīss Rikters, Ryokan Ri, Tong Li, and Toshiaki Nakazawa. 2020. Document-aligned Japanese-English Conversation Parallel Corpus. In Proceedings of the Fifth Conference on Machine Translation, pages 639–645, Online. Association for Computational Linguistics.
Cite (Informal):: Document-aligned Japanese-English Conversation Parallel Corpus (Rikters et al., WMT 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/update-css-js/2020.wmt-1.74.pdf
Optional supplementary material:: 2020.wmt-1.74.OptionalSupplementaryMaterial.pdf
Video:: https://slideslive.com/38939560
Code: tsuruoka-lab/AMI-Meeting-Parallel-Corpus
Data: Business Scene Dialogue, JParaCrawl

PDF Cite Search Code Optional supplementary material Video