Hideki Tanaka


2022

pdf
A Multilingual Multiway Evaluation Data Set for Structured Document Translation of Asian Languages
Bianka Buschbeck | Raj Dabre | Miriam Exel | Matthias Huck | Patrick Huy | Raphael Rubino | Hideki Tanaka
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Translation of structured content is an important application of machine translation, but the scarcity of evaluation data sets, especially for Asian languages, limits progress. In this paper we present a novel multilingual multiway evaluation data set for the translation of structured documents of the Asian languages Japanese, Korean and Chinese. We describe the data set, its creation process and important characteristics, followed by establishing and evaluating baselines using the direct translation as well as detag-project approaches. Our data set is well suited for multilingual evaluation, and it contains richer annotation tag sets than existing data sets. Our results show that massively multilingual translation models like M2M-100 and mBART-50 perform surprisingly well despite not being explicitly trained to handle structured content. The data set described in this paper and used in our experiments is released publicly.

pdf
FeatureBART: Feature Based Sequence-to-Sequence Pre-Training for Low-Resource NMT
Abhisek Chakrabarty | Raj Dabre | Chenchen Ding | Hideki Tanaka | Masao Utiyama | Eiichiro Sumita
Proceedings of the 29th International Conference on Computational Linguistics

In this paper we present FeatureBART, a linguistically motivated sequence-to-sequence monolingual pre-training strategy in which syntactic features such as lemma, part-of-speech and dependency labels are incorporated into the span prediction based pre-training framework (BART). These automatically extracted features are incorporated via approaches such as concatenation and relevance mechanisms, among which the latter is known to be better than the former. When used for low-resource NMT as a downstream task, we show that these feature based models give large improvements in bilingual settings and modest ones in multilingual settings over their counterparts that do not use features.

2021

pdf
Field Experiments of Real Time Foreign News Distribution Powered by MT
Keiji Yasuda | Ichiro Yamada | Naoaki Okazaki | Hideki Tanaka | Hidehiro Asaka | Takeshi Anzai | Fumiaki Sugaya
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

Field experiments on a foreign news distribution system using two key technologies are reported. The first technology is a summarization component, which is used for generating news headlines. This component is a transformer-based abstractive text summarization system which is trained to output headlines from the leading sentences of news articles. The second technology is machine translation (MT), which enables users to read foreign news articles in their mother language. Since the system uses MT, users can immediately access the latest foreign news. 139 Japanese LINE users participated in the field experiments for two weeks, viewing about 40,000 articles which had been translated from English to Japanese. We carried out surveys both during and after the experiments. According to the results, 79.3% of users evaluated the headlines as adequate, while 74.7% of users evaluated the automatically translated articles as intelligible. According to the post-experiment survey, 59.7% of users wished to continue using the system; 11.5% of users did not. We also report several statistics of the experiments.

2020

pdf
Neural Machine Translation Using Extracted Context Based on Deep Analysis for the Japanese-English Newswire Task at WAT 2020
Isao Goto | Hideya Mino | Hitoshi Ito | Kazutaka Kinugawa | Ichiro Yamada | Hideki Tanaka
Proceedings of the 7th Workshop on Asian Translation

This paper describes the system of the NHK-NES team for the WAT 2020 Japanese–English newswire task. There are two main problems in Japanese-English news translation: translation of dropped subjects and compatibility between equivalent translations and English news-style outputs. We address these problems by extracting subjects from the context based on predicate-argument structures and using them as additional inputs, and constructing parallel Japanese-English news sentences equivalently translated from English news sentences. The evaluation results confirm the effectiveness of our context-utilization method.

pdf
Content-Equivalent Translated Parallel News Corpus and Extension of Domain Adaptation for NMT
Hideya Mino | Hideki Tanaka | Hitoshi Ito | Isao Goto | Ichiro Yamada | Takenobu Tokunaga
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we deal with two problems in Japanese-English machine translation of news articles. The first problem is the quality of parallel corpora. Neural machine translation (NMT) systems suffer degraded performance when trained with noisy data. Because there is no clean Japanese-English parallel data for news articles, we build a novel parallel news corpus consisting of Japanese news articles translated into English in a content-equivalent manner. This is the first content-equivalent Japanese-English news corpus translated specifically for training NMT systems. The second problem involves the domain-adaptation technique. NMT systems suffer degraded performance when trained with mixed data having different features, such as noisy data and clean data. Though the existing methods try to overcome this problem by using tags for distinguishing the differences between corpora, it is not sufficient. We thus extend a domain-adaptation method using multi-tags to train an NMT model effectively with the clean corpus and existing parallel news corpora with some types of noise. Experimental results show that our corpus increases the translation quality, and that our domain-adaptation method is more effective for learning with the multiple types of corpora than existing domain-adaptation methods are.

2019

pdf
Neural Machine Translation System using a Content-equivalently Translated Parallel Corpus for the Newswire Translation Tasks at WAT 2019
Hideya Mino | Hitoshi Ito | Isao Goto | Ichiro Yamada | Hideki Tanaka | Takenobu Tokunaga
Proceedings of the 6th Workshop on Asian Translation

This paper describes NHK and NHK Engineering System (NHK-ES)’s submission to the newswire translation tasks of WAT 2019 in both directions of Japanese→English and English→Japanese. In addition to the JIJI Corpus that was officially provided by the task organizer, we developed a corpus of 0.22M sentence pairs by manually, translating Japanese news sentences into English content- equivalently. The content-equivalent corpus was effective for improving translation quality, and our systems achieved the best human evaluation scores in the newswire translation tasks at WAT 2019.

2017

pdf
Detecting Untranslated Content for Neural Machine Translation
Isao Goto | Hideki Tanaka
Proceedings of the First Workshop on Neural Machine Translation

Despite its promise, neural machine translation (NMT) has a serious problem in that source content may be mistakenly left untranslated. The ability to detect untranslated content is important for the practical use of NMT. We evaluate two types of probability with which to detect untranslated content: the cumulative attention (ATN) probability and back translation (BT) probability from the target sentence to the source sentence. Experiments on detecting untranslated content in Japanese-English patent translations show that ATN and BT are each more effective than random choice, BT is more effective than ATN, and the combination of the two provides further improvements. We also confirmed the effectiveness of using ATN and BT to rerank the n-best NMT outputs.

2015

pdf
Japanese news simplification: tak design, data set construction, and analysis of simplified text
Isao Goto | Hideki Tanaka | Tadashi Kumano
Proceedings of Machine Translation Summit XV: Papers

pdf
The “News Web Easy” news service as a resource for teaching and learning Japanese: An assessment of the comprehension difficulty of Japanese sentence-end expressions
Hideki Tanaka | Tadashi Kumano | Isao Goto
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications

2012

pdf
Measuring the Similarity between TV Programs using Semantic Relations
Ichiro Yamada | Masaru Miyazaki | Hideki Sumiyoshi | Atsushi Matsui | Hironori Furumiya | Hideki Tanaka
Proceedings of COLING 2012

2009

pdf
Syntax-Driven Sentence Revision for Broadcast News Summarization
Hideki Tanaka | Akinori Kinoshita | Takeshi Kobayakawa | Tadashi Kumano | Naoto Katoh
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

2007

pdf
Extracting phrasal alignments from comparable corpora by using joint probability SMT model
Tadashi Kumano | Hideki Tanaka | Takenobu Tokunaga
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

2005

pdf
Analysis and Modeling of Manual Summarization of Japanese Broadcast News
Hideki Tanaka | Tadashi Kumano | Masamichi Nishiwaki | Takayuki Itoh
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

2004

pdf
Back Transliteration from Japanese to English using Target English Context
Isao Goto | Naoto Kato | Terumasa Ehara | Hideki Tanaka
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

pdf
Word Selection for EBMT based on Monolingual Similarity and Translation Confidence
Eiji Aramaki | Sadao Kurohashi | Hideki Kashioka | Hideki Tanaka
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf
Comparing the Sentence Alignment Yield from Two News Corpora Using a Dictionary-Based Alignment System
Stephen Nightingale | Hideki Tanaka
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf
Construction and Analysis of Japanese-English Broadcast News Corpus with Named Entity Tags
Tadashi Kumano | Hideki Kashioka | Hideki Tanaka | Takahiro Fukusima
Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition

pdf
Building a parallel corpus for monologues with clause alignment
Hideki Kashioka | Takehiko Maruyama | Hideki Tanaka
Proceedings of Machine Translation Summit IX: Papers

Many studies have been reported in the domain of speech-to-speech machine translation systems for travel conversation use. Therefore, a large number of travel domain corpora have become available in recent years. From a wider viewpoint, speech-to-speech systems are required for many purposes other than travel conversation. One of these is monologues (e.g., TV news, lectures, technical presentations). However, in monologues, sentences tend to be long and complicated, which often causes problems for parsing and translation. Therefore, we need a suitable translation unit, rather than the sentence. We propose the clause as a unit for translation. To develop a speech-to-speech machine translation system for monologues based on the clause as the translation unit, we need a monologue parallel corpus with clause alignment. In this paper, we describe how to build a Japanese-English monologue parallel corpus with clauses aligned, and discuss the features of this corpus.

pdf
A multi-language translation example browser
Isao Goto | Naoto Kato | Noriyoshi Uratani | Terumasa Ehara | Tadashi Kumano | Hideki Tanaka
Proceedings of Machine Translation Summit IX: System Presentations

This paper describes a Multi-language Translation Example Browser, a type of translation memory system. The system is able to retrieve translation examples from bilingual news databases, which consist of news transcripts of past broadcasts. We put a Japanese-English system to practical use and undertook trial operations of a system of eight language-pairs.

2002

pdf
Automatic Alignment of Japanese and English Newspaper Articles using an MT System and a Bilingual Company Name Dictionary
Kenji Matsumoto | Hideki Tanaka
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

pdf
ATR-SLT System for SENSEVAL-2 Japanese Translation Task
Tadashi Kumano | Hideki Kashioka | Hideki Tanaka
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

1999

pdf
An Efficient Statistical Speech Act Type Tagging System for Speech Translation Systems
Hideki Tanaka | Akio Yokoo
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

pdf
Context Management with Topics for Spoken Dialogue Systems
Kristiina Jokinen | Hideki Tanaka | Akio Yokoo
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf
Planning Dialogue Contributions With New Information
Kristiina Jokinen | Hideki Tanaka | Akio Yokoo
Natural Language Generation

pdf
Context Management with Topics for Spoken Dialogue Systems
Kristiina Jokinen | Hideki Tanaka | Akio Yokoo
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

1996

pdf
Decision Tree Learning Algorithm with Structured Attributes: Application to Verbal Case Frame Acquisition
Hideki Tanaka
COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics

1994

pdf
Verbal Case Frame Acquisition From a Bilingual Corpus: Gradual Knowledge Acquisition
Hideki Tanaka
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

1992

pdf
A Method of Translating English Delexical Structures Into Japanese
Hideki Tanaka | Teruaki Aizawa | Yeun-Bae Kim | Nobuko Hatada
COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics

1990

pdf
A Machine Translation System for Foreign News in Satellite Broadcasting
Teruaki Aizawa | Terumasa Ehara | Noriyoshi Uratani | Hideki Tanaka | Naoto Kato | Sumio Nakase | Norikazu Aruga | Takeo Matsuda
COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics