Taku Kudo

Also published as: Taku Kudoh


2018

pdf bib
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Taku Kudo | John Richardson
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.

pdf bib
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Taku Kudo
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.

2016

pdf bib
Phrase-based Machine Translation using Multiple Preordering Candidates
Yusuke Oda | Taku Kudo | Tetsuji Nakagawa | Taro Watanabe
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we propose a new decoding method for phrase-based statistical machine translation which directly uses multiple preordering candidates as a graph structure. Compared with previous phrase-based decoding methods, our method is based on a simple left-to-right dynamic programming in which no decoding-time reordering is performed. As a result, its runtime is very fast and implementing the algorithm becomes easy. Our system does not depend on specific preordering methods as long as they output multiple preordering candidates, and it is trivial to employ existing preordering methods into our system. In our experiments for translating diverse 11 languages into English, the proposed method outperforms conventional phrase-based decoder in terms of translation qualities under comparable or faster decoding time.

2014

pdf bib
A joint inference of deep case analysis and zero subject generation for Japanese-to-English statistical machine translation
Taku Kudo | Hiroshi Ichikawa | Hideto Kazawa
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

pdf bib
Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011)
Hideto Kazawa | Hisami Suzuki | Taku Kudo
Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011)

pdf bib
Efficient dictionary and language model compression for input method editors
Taku Kudo | Toshiyuki Hanaoka | Jun Mukai | Yusuke Tabata | Hiroyuki Komatsu
Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011)

2008

pdf bib
Graph-based Analysis of Semantic Drift in Espresso-like Bootstrapping Algorithms
Mamoru Komachi | Taku Kudo | Masashi Shimbo | Yuji Matsumoto
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2005

pdf bib
Boosting-based Parse Reranking with Subtree Features
Taku Kudo | Jun Suzuki | Hideki Isozaki
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

2004

pdf bib
Applying Conditional Random Fields to Japanese Morphological Analysis
Taku Kudo | Kaoru Yamamoto | Yuji Matsumoto
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Boosting Algorithm for Classification of Semi-Structured Text
Taku Kudo | Yuji Matsumoto
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

2003

pdf bib
Fast Methods for Kernel-Based Text Analysis
Taku Kudo | Yuji Matsumoto
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
Learning Sequence-to-Sequence Correspondences from Parallel Corpora via Sequential Pattern Mining
Kaoru Yamamoto | Taku Kudo | Yuta Tsuboi | Yuji Matsumoto
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf bib
Protein Name Tagging for Biomedical Annotation in Text
Kaoru Yamamoto | Taku Kudo | Akihiko Konagaya | Yuji Matsumoto
Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine

2002

pdf bib
Revision Learning and its Application to Part-of-Speech Tagging
Tetsuji Nakagawa | Taku Kudo | Yuji Matsumoto
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

pdf bib
Japanese Dependency Analysis using Cascaded Chunking
Taku Kudo | Yuji Matsumoto
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

2001

pdf bib
Chunking with Support Vector Machines
Taku Kudo | Yuji Matsumoto
Second Meeting of the North American Chapter of the Association for Computational Linguistics

2000

pdf bib
Use of Support Vector Learning for Chunk Identification
Taku Kudoh | Yuji Matsumoto
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop

pdf bib
Japanese Dependency Structure Analysis Based on Support Vector Machines
Taku Kudo | Yuji Matsumoto
2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora