John Richardson

2018

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Taku Kudo | John Richardson
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.

2016

pdf bib

Insertion Position Selection Model for Flexible Non-Terminals in Dependency Tree-to-Tree Machine Translation
Toshiaki Nakazawa | John Richardson | Sadao Kurohashi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib abs

In recent years there has been a surge of interest in the natural language prosessing related to the real world, such as symbol grounding, language generation, and nonlinguistic data search by natural language queries. In order to concentrate on language ambiguities, we propose to use a well-defined “real world,” that is game states. We built a corpus consisting of pairs of sentences and a game state. The game we focus on is shogi (Japanese chess). We collected 742,286 commentary sentences in Japanese. They are spontaneously generated contrary to natural language annotations in many image datasets provided by human workers on Amazon Mechanical Turk. We defined domain specific named entities and we segmented 2,508 sentences into words manually and annotated each word with a named entity tag. We describe a detailed definition of named entities and show some statistics of our game commentary corpus. We also show the results of the experiments of word segmentation and named entity recognition. The accuracies are as high as those on general domain texts indicating that we are ready to tackle various new problems related to the real world.

pdf bib

Flexible Non-Terminals for Dependency Tree-to-Tree Reordering
John Richardson | Fabien Cromières | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

pdf bib

Enhancing function word translation with syntax-based statistical post-editing
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 6th Workshop on Patent and Scientific Literature Translation

pdf bib

pdf bib

Pivot-Based Topic Models for Low-Resource Lexicon Extraction
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

2014

pdf bib abs

Bilingual Dictionary Construction with Transliteration Filtering
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present a bilingual transliteration lexicon of 170K Japanese-English technical terms in the scientific domain. Translation pairs are extracted by filtering a large list of transliteration candidates generated automatically from a phrase table trained on parallel corpora. Filtering uses a novel transliteration similarity measure based on a discriminative phrase-based machine translation approach. We demonstrate that the extracted dictionary is accurate and of high recall (F1 score 0.8). Our lexicon contains not only single words but also multi-word expressions, and is freely available. Our experiments focus on Katakana-English lexicon construction, however it would be possible to apply the proposed methods to transliteration extraction for a variety of language pairs.

pdf bib

KyotoEBMT: An Example-Based Dependency-to-Dependency Translation Framework
John Richardson | Fabien Cromières | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib

KyotoEBMT System Description for the 1st Workshop on Asian Translation
John Richardson | Fabien Cromières | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 1st Workshop on Asian Translation (WAT2014)