Workshop on Asian Translation (2022)


up

pdf (full)
bib (full)
Proceedings of the 9th Workshop on Asian Translation

pdf bib
Proceedings of the 9th Workshop on Asian Translation

pdf bib
Overview of the 9th Workshop on Asian Translation
Toshiaki Nakazawa | Hideya Mino | Isao Goto | Raj Dabre | Shohei Higashiyama | Shantipriya Parida | Anoop Kunchukuttan | Makoto Morishita | Ondřej Bojar | Chenhui Chu | Akiko Eriguchi | Kaori Abe | Yusuke Oda | Sadao Kurohashi

This paper presents the results of the shared tasks from the 9th workshop on Asian translation (WAT2022). For the WAT2022, 8 teams submitted their translation results for the human evaluation. We also accepted 4 research papers. About 300 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib
Comparing BERT-based Reward Functions for Deep Reinforcement Learning in Machine Translation
Yuki Nakatani | Tomoyuki Kajiwara | Takashi Ninomiya

In text generation tasks such as machine translation, models are generally trained using cross-entropy loss. However, mismatches between the loss function and the evaluation metric are often problematic. It is known that this problem can be addressed by direct optimization to the evaluation metric with reinforcement learning. In machine translation, previous studies have used BLEU to calculate rewards for reinforcement learning, but BLEU is not well correlated with human evaluation. In this study, we investigate the impact on machine translation quality through reinforcement learning based on evaluation metrics that are more highly correlated with human evaluation. Experimental results show that reinforcement learning with BERT-based rewards can improve various evaluation metrics.

pdf
Improving Jejueo-Korean Translation With Cross-Lingual Pretraining Using Japanese and Korean
Francis Zheng | Edison Marrese-Taylor | Yutaka Matsuo

Jejueo is a critically endangered language spoken on Jeju Island and is closely related to but mutually unintelligible with Korean. Parallel data between Jejueo and Korean is scarce, and translation between the two languages requires more attention, as current neural machine translation systems typically rely on large amounts of parallel training data. While low-resource machine translation has been shown to benefit from using additional monolingual data during the pretraining process, not as much research has been done on how to select languages other than the source and target languages for use during pretraining. We show that using large amounts of Korean and Japanese data during the pretraining process improves translation by 2.16 BLEU points for translation in the Jejueo → Korean direction and 1.34 BLEU points for translation in the Korean → Jejueo direction compared to the baseline.

pdf
TMU NMT System with Automatic Post-Editing by Multi-Source Levenshtein Transformer for the Restricted Translation Task of WAT 2022
Seiichiro Kondo | Mamoru Komachi

In this paper, we describe our TMU English–Japanese systems submitted to the restricted translation task at WAT 2022 (Nakazawa et al., 2022). In this task, we translate an input sentence with the constraint that certain words or phrases (called restricted target vocabularies (RTVs)) should be contained in the output sentence. To satisfy this constraint, we address this task using a combination of two techniques. One is lexical-constraint-aware neural machine translation (LeCA) (Chen et al., 2020), which is a method of adding RTVs at the end of input sentences. The other is multi-source Levenshtein transformer (MSLevT) (Wan et al., 2020), which is a non-autoregressive method for automatic post-editing. Our system generates translations in two steps. First, we generate the translation using LeCA. Subsequently, we filter the sentences that do not satisfy the constraints and post-edit them with MSLevT. Our experimental results reveal that 100% of the RTVs can be included in the generated sentences while maintaining the translation quality of the LeCA model on both English to Japanese (En→Ja) and Japanese to English (Ja→En) tasks. Furthermore, the method used in previous studies requires an increase in the beam size to satisfy the constraints, which is computationally expensive. In contrast, the proposed method does not require a similar increase and can generate translations faster.

pdf
HwTscSU’s Submissions on WAT 2022 Shared Task
Yilun Liu | Zhen Zhang | Shimin Tao | Junhui Li | Hao Yang

In this paper we describe our submission to the shared tasks of the 9th Workshop on Asian Translation (WAT 2022) on NICT–SAP under the team name ”HwTscSU”. The tasks involve translation from 5 languages into English and vice-versa in two domains: IT domain and Wikinews domain. The purpose is to determine the feasibility of multilingualism, domain adaptation or document-level knowledge given very little to none clean parallel corpora for training. Our approach for all translation tasks mainly focused on pre-training NMT models on general datasets and fine-tuning them on domain-specific datasets. Due to the small amount of parallel corpora, we collected and cleaned the OPUS dataset including three IT domain corpora, i.e., GNOME, KDE4, and Ubuntu. We then trained Transformer models on the collected dataset and fine-tuned on corresponding dev set. The BLEU scores greatly improved in comparison with other systems. Our submission ranked 1st in all IT-domain tasks and in one out of eight ALT domain tasks.

pdf
NICT’s Submission to the WAT 2022 Structured Document Translation Task
Raj Dabre

We present our submission to the structured document translation task organized by WAT 2022. In structured document translation, the key challenge is the handling of inline tags, which annotate text. Specifically, the text that is annotated by tags, should be translated in such a way that in the translation should contain the tags annotating the translation. This challenge is further compounded by the lack of training data containing sentence pairs with inline XML tag annotated content. However, to our surprise, we find that existing multilingual NMT systems are able to handle the translation of text annotated with XML tags without any explicit training on data containing said tags. Specifically, massively multilingual translation models like M2M-100 perform well despite not being explicitly trained to handle structured content. This direct translation approach is often either as good as if not better than the traditional approach of “remove tag, translate and re-inject tag” also known as the “detag-and-project” approach.

pdf
Rakuten’s Participation in WAT 2022: Parallel Dataset Filtering by Leveraging Vocabulary Heterogeneity
Alberto Poncelas | Johanes Effendi | Ohnmar Htun | Sunil Yadav | Dongzhe Wang | Saurabh Jain

This paper introduces our neural machine translation system’s participation in the WAT 2022 shared translation task (team ID: sakura). We participated in the Parallel Data Filtering Task. Our approach based on Feature Decay Algorithms achieved +1.4 and +2.4 BLEU points for English to Japanese and Japanese to English respectively compared to the model trained on the full dataset, showing the effectiveness of FDA on in-domain data selection.

pdf
NIT Rourkela Machine Translation(MT) System Submission to WAT 2022 for MultiIndicMT: An Indic Language Multilingual Shared Task
Sudhansu Bala Das | Atharv Biradar | Tapas Kumar Mishra | Bidyut Kumar Patra

Multilingual Neural Machine Translation (MNMT) exhibits incredible performance with the development of a single translation model for many languages. Previous studies on multilingual translation reveal that multilingual training is effective for languages with limited corpus. This paper presents our submission (Team Id: NITR) in the WAT 2022 for “MultiIndicMT shared task” where the objective of the task is the translation between 5 Indic languages from OPUS Corpus (which are newly added in WAT 2022 corpus) into English and vice versa using the corpus provided by the organizer of WAT. Our system is based on a transformer-based NMT using fairseq modelling toolkit with ensemble techniques. Heuristic pre-processing approaches are carried out before keeping the model under training. Our multilingual NMT systems are trained with shared encoder and decoder parameters followed by assigning language embeddings to each token in both encoder and decoder. Our final multilingual system was examined by using BLEU and RIBES metrics scores. In future, we look forward to extend our research that will help in fine-tuning of both encoder and decoder during the monolingual unsupervised training in order to improve the quality of the synthetic data generated during the process.

pdf
Investigation of Multilingual Neural Machine Translation for Indian Languages
Sahinur Rahman Laskar | Riyanka Manna | Partha Pakray | Sivaji Bandyopadhyay

In the domain of natural language processing, machine translation is a well-defined task where one natural language is automatically translated to another natural language. The deep learning-based approach of machine translation, known as neural machine translation attains remarkable translational performance. However, it requires a sufficient amount of training data which is a critical issue for low-resource pair translation. To handle the data scarcity problem, the multilingual concept has been investigated in neural machine translation in different settings like many-to-one and one-to-many translation. WAT2022 (Workshop on Asian Translation 2022) organizes (hosted by the COLING 2022) Indic tasks: English-to-Indic and Indic-to-English translation tasks where we have participated as a team named CNLP-NITS-PP. Herein, we have investigated a transliteration-based approach, where Indic languages are transliterated into English script and shared sub-word level vocabulary during the training phase. We have attained BLEU scores of 2.0 (English-to-Bengali), 1.10 (English-to-Assamese), 4.50 (Bengali-to-English), and 3.50 (Assamese-to-English) translation, respectively.

pdf
Does partial pretranslation can improve low ressourced-languages pairs?
Raoul Blin

We study the effects of a local and punctual pretranslation of the source corpus on the performance of a Transformer translation model. The pretranslations are performed at the morphological (morpheme translation), lexical (word translation) and morphosyntactic (numeral groups and dates) levels. We focus on small and medium-sized training corpora (50K 2.5M bisegments) and on a linguistically distant language pair (Japanese and French). We find that this type of pretranslation does not lead to significant progress. We describe the motivations of the approach, the specific difficulties of Japanese-French translation. We discuss the possible reasons for the observed underperformance.

pdf
Multimodal Neural Machine Translation with Search Engine Based Image Retrieval
ZhenHao Tang | XiaoBing Zhang | Zi Long | XiangHua Fu

Recently, numbers of works shows that the performance of neural machine translation (NMT) can be improved to a certain extent with using visual information. However, most of these conclusions are drawn from the analysis of experimental results based on a limited set of bilingual sentence-image pairs, such as Multi30K.In these kinds of datasets, the content of one bilingual parallel sentence pair must be well represented by a manually annotated image,which is different with the actual translation situation. we propose an open-vocabulary image retrieval methods to collect descriptive images for bilingual parallel corpus using image search engine, and we propose text-aware attentive visual encoder to filter incorrectly collected noise images. Experiment results on Multi30K and other two translation datasets show that our proposed method achieves significant improvements over strong baselines.

pdf
Silo NLP’s Participation at WAT2022
Shantipriya Parida | Subhadarshi Panda | Stig-Arne Grönroos | Mark Granroth-Wilding | Mika Koistinen

This paper provides the system description of “Silo NLP’s” submission to the Workshop on Asian Translation (WAT2022). We have participated in the Indic Multimodal tasks (English->Hindi, English->Malayalam, and English->Bengali, Multimodal Translation). For text-only translation, we used the Transformer and fine-tuned the mBART. For multimodal translation, we used the same architecture and extracted object tags from the images to use as visual features concatenated with the text sequence for input. Our submission tops many tasks including English->Hindi multimodal translation (evaluation test), English->Malayalam text-only and multimodal translation (evaluation test), English->Bengali multimodal translation (challenge test), and English->Bengali text-only translation (evaluation test).

pdf
PICT@WAT 2022: Neural Machine Translation Systems for Indic Languages
Anupam Patil | Isha Joshi | Dipali Kadam

Translation entails more than simply translating words from one language to another. It is vitally essential for effective cross-cultural communication, thus making good translation systems an important requirement. We describe our systems in this paper, which were submitted to the WAT 2022 translation shared tasks. As part of the Multi-modal translation tasks’ text-only translation sub-tasks, we submitted three Neural Machine Translation systems based on Transformer models for English to Malayalam, English to Bengali, and English to Hindi text translation. We found significant results on the leaderboard for English-Indic (en-xx) systems utilizing BLEU and RIBES scores as comparative metrics in our studies. For the respective translations of English to Malayalam, Bengali, and Hindi, we obtained BLEU scores of 19.50, 32.90, and 41.80 for the challenge subset and 30.60, 39.80, and 42.90 on the benchmark evaluation subset data.

pdf
English to Bengali Multimodal Neural Machine Translation using Transliteration-based Phrase Pairs Augmentation
Sahinur Rahman Laskar | Pankaj Dadure | Riyanka Manna | Partha Pakray | Sivaji Bandyopadhyay

Automatic translation of one natural language to another is a popular task of natural language processing. Although the deep learning-based technique known as neural machine translation (NMT) is a widely accepted machine translation approach, it needs an adequate amount of training data, which is a challenging issue for low-resource pair translation. Moreover, the multimodal concept utilizes text and visual features to improve low-resource pair translation. WAT2022 (Workshop on Asian Translation 2022) organizes (hosted by the COLING 2022) English to Bengali multimodal translation task where we have participated as a team named CNLP-NITS-PP in two tracks: 1) text-only and 2) multimodal translation. Herein, we have proposed a transliteration-based phrase pairs augmentation approach which shows improvement in the multimodal translation task and achieved benchmark results on Bengali Visual Genome 1.0 dataset. We have attained the best results on the challenge and evaluation test set for English to Bengali multimodal translation with BLEU scores of 28.70, 43.90 and RIBES scores of 0.688931, 0.780669, respectively.

pdf
Investigation of English to Hindi Multimodal Neural Machine Translation using Transliteration-based Phrase Pairs Augmentation
Sahinur Rahman Laskar | Rahul Singh | Md Faizal Karim | Riyanka Manna | Partha Pakray | Sivaji Bandyopadhyay

Machine translation translates one natural language to another, a well-defined natural language processing task. Neural machine translation (NMT) is a widely accepted machine translation approach, but it requires a sufficient amount of training data, which is a challenging issue for low-resource pair translation. Moreover, the multimodal concept utilizes text and visual features to improve low-resource pair translation. WAT2022 (Workshop on Asian Translation 2022) organizes (hosted by the COLING 2022) English to Hindi multimodal translation task where we have participated as a team named CNLP-NITS-PP in two tracks: 1) text-only and 2) multimodal translation. Herein, we have proposed a transliteration-based phrase pairs augmentation approach, which shows improvement in the multimodal translation task. We have attained the second best results on the challenge test set for English to Hindi multimodal translation with BLEU score of 39.30, and a RIBES score of 0.791468.