Masaru Kitsuregawa


2022

pdf
Building Large-Scale Japanese Pronunciation-Annotated Corpora for Reading Heteronymous Logograms
Fumikazu Sato | Naoki Yoshinaga | Masaru Kitsuregawa
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Although screen readers enable visually impaired people to read written text via speech, the ambiguities in pronunciations of heteronyms cause wrong reading, which has a serious impact on the text understanding. Especially in Japanese, there are many common heteronyms expressed by logograms (Chinese characters or kanji) that have totally different pronunciations (and meanings). In this study, to improve the accuracy of pronunciation prediction, we construct two large-scale Japanese corpora that annotate kanji characters with their pronunciations. Using existing language resources on i) book titles compiled by the National Diet Library and ii) the books in a Japanese digital library called Aozora Bunko and their Braille translations, we develop two large-scale pronunciation-annotated corpora for training pronunciation prediction models. We first extract sentence-level alignments between the Aozora Bunko text and its pronunciation converted from the Braille data. We then perform dictionary-based pattern matching based on morphological dictionaries to find word-level pronunciation alignments. We have ultimately obtained the Book Title corpus with 336M characters (16.4M book titles) and the Aozora Bunko corpus with 52M characters (1.6M sentences). We analyzed pronunciation distributions for 203 common heteronyms, and trained a BERT-based pronunciation prediction model for 93 heteronyms, which achieved an average accuracy of 0.939.

2021

pdf
Speculative Sampling in Variational Autoencoders for Dialogue Response Generation
Shoetsu Sato | Naoki Yoshinaga | Masashi Toyoda | Masaru Kitsuregawa
Findings of the Association for Computational Linguistics: EMNLP 2021

Variational autoencoders have been studied as a promising approach to model one-to-many mappings from context to response in chat response generation. However, they often fail to learn proper mappings. One of the reasons for this failure is the discrepancy between a response and a latent variable sampled from an approximated distribution in training. Inappropriately sampled latent variables hinder models from constructing a modulated latent space. As a result, the models stop handling uncertainty in conversations. To resolve that, we propose speculative sampling of latent variables. Our method chooses the most probable one from redundantly sampled latent variables for tying up the variable with a given response. We confirm the efficacy of our method in response generation with massive dialogue data constructed from Twitter posts.

2020

pdf
Vocabulary Adaptation for Domain Adaptation in Neural Machine Translation
Shoetsu Sato | Jin Sakuma | Naoki Yoshinaga | Masashi Toyoda | Masaru Kitsuregawa
Findings of the Association for Computational Linguistics: EMNLP 2020

Neural network methods exhibit strong performance only in a few resource-rich domains. Practitioners therefore employ domain adaptation from resource-rich domains that are, in most cases, distant from the target domain. Domain adaptation between distant domains (e.g., movie subtitles and research papers), however, cannot be performed effectively due to mismatches in vocabulary; it will encounter many domain-specific words (e.g., “angstrom”) and words whose meanings shift across domains (e.g., “conductor”). In this study, aiming to solve these vocabulary mismatches in domain adaptation for neural machine translation (NMT), we propose vocabulary adaptation, a simple method for effective fine-tuning that adapts embedding layers in a given pretrained NMT model to the target domain. Prior to fine-tuning, our method replaces the embedding layers of the NMT model by projecting general word embeddings induced from monolingual data in a target domain onto a source-domain embedding space. Experimental results indicate that our method improves the performance of conventional fine-tuning by 3.86 and 3.28 BLEU points in En-Ja and De-En translation, respectively.

pdf
Robust Backed-off Estimation of Out-of-Vocabulary Embeddings
Nobukazu Fukuda | Naoki Yoshinaga | Masaru Kitsuregawa
Findings of the Association for Computational Linguistics: EMNLP 2020

Out-of-vocabulary (oov) words cause serious troubles in solving natural language tasks with a neural network. Existing approaches to this problem resort to using subwords, which are shorter and more ambiguous units than words, in order to represent oov words with a bag of subwords. In this study, inspired by the processes for creating words from known words, we propose a robust method of estimating oov word embeddings by referring to pre-trained word embeddings for known words with similar surfaces to target oov words. We collect known words by segmenting oov words and by approximate string matching, and we then aggregate their pre-trained embeddings. Experimental results show that the obtained oov word embeddings improve not only word similarity tasks but also downstream tasks in Twitter and biomedical domains where oov words often appear, even when the computed oov embeddings are integrated into a bert-based strong baseline.

pdf
A System for Worldwide COVID-19 Information Aggregation
Akiko Aizawa | Frederic Bergeron | Junjie Chen | Fei Cheng | Katsuhiko Hayashi | Kentaro Inui | Hiroyoshi Ito | Daisuke Kawahara | Masaru Kitsuregawa | Hirokazu Kiyomaru | Masaki Kobayashi | Takashi Kodama | Sadao Kurohashi | Qianying Liu | Masaki Matsubara | Yusuke Miyao | Atsuyuki Morishima | Yugo Murawaki | Kazumasa Omura | Haiyue Song | Eiichiro Sumita | Shinji Suzuki | Ribeka Tanaka | Yu Tanaka | Masashi Toyoda | Nobuhiro Ueda | Honai Ueoka | Masao Utiyama | Ying Zhong
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.

2019

pdf
Learning to Describe Unknown Phrases with Local and Global Contexts
Shonosuke Ishiwatari | Hiroaki Hayashi | Naoki Yoshinaga | Graham Neubig | Shoetsu Sato | Masashi Toyoda | Masaru Kitsuregawa
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

When reading a text, it is common to become stuck on unfamiliar words and phrases, such as polysemous words with novel senses, rarely used idioms, internet slang, or emerging entities. If we humans cannot figure out the meaning of those expressions from the immediate local context, we consult dictionaries for definitions or search documents or the web to find other global context to help in interpretation. Can machines help us do this work? Which type of context is more important for machines to solve the problem? To answer these questions, we undertake a task of describing a given phrase in natural language based on its local and global contexts. To solve this task, we propose a neural description model that consists of two context encoders and a description decoder. In contrast to the existing methods for non-standard English explanation [Ni+ 2017] and definition generation [Noraset+ 2017; Gadetsky+ 2018], our model appropriately takes important clues from both local and global contexts. Experimental results on three existing datasets (including WordNet, Oxford and Urban Dictionaries) and a dataset newly created from Wikipedia demonstrate the effectiveness of our method over previous work.

2017

pdf
Chunk-based Decoder for Neural Machine Translation
Shonosuke Ishiwatari | Jingtao Yao | Shujie Liu | Mu Li | Ming Zhou | Naoki Yoshinaga | Masaru Kitsuregawa | Weijia Jia
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Chunks (or phrases) once played a pivotal role in machine translation. By using a chunk rather than a word as the basic translation unit, local (intra-chunk) and global (inter-chunk) word orders and dependencies can be easily modeled. The chunk structure, despite its importance, has not been considered in the decoders used for neural machine translation (NMT). In this paper, we propose chunk-based decoders for (NMT), each of which consists of a chunk-level decoder and a word-level decoder. The chunk-level decoder models global dependencies while the word-level decoder decides the local word order in a chunk. To output a target sentence, the chunk-level decoder generates a chunk representation containing global information, which the word-level decoder then uses as a basis to predict the words inside the chunk. Experimental results show that our proposed decoders can significantly improve translation performance in a WAT ‘16 English-to-Japanese translation task.

pdf
Modeling Situations in Neural Chat Bots
Shoetsu Sato | Naoki Yoshinaga | Masashi Toyoda | Masaru Kitsuregawa
Proceedings of ACL 2017, Student Research Workshop

2016

pdf
Kotonush: Understanding Concepts Based on Values behind Social Media
Tatsuya Iwanari | Kohei Ohara | Naoki Yoshinaga | Nobuhiro Kaji | Masashi Toyoda | Masaru Kitsuregawa
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

Kotonush, a system that clarifies people’s values on various concepts on the basis of what they write about on social media, is presented. The values are represented by ordering sets of concepts (e.g., London, Berlin, and Rome) in accordance with a common attribute intensity expressed by an adjective (e.g., entertaining). We exploit social media text written by different demographics and at different times in order to induce specific orderings for comparison. The system combines a text-to-ordering module with an interactive querying interface enabled by massive hyponymy relations and provides mechanisms to compare the induced orderings from various viewpoints. We empirically evaluate Kotonush and present some case studies, featuring real-world concept orderings with different domains on Twitter, to demonstrate the usefulness of our system.

2015

pdf
Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs
Shonosuke Ishiwatari | Nobuhiro Kaji | Naoki Yoshinaga | Masashi Toyoda | Masaru Kitsuregawa
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

2014

pdf
A Self-adaptive Classifier for Efficient Text-stream Processing
Naoki Yoshinaga | Masaru Kitsuregawa
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf
Accurate Word Segmentation and POS Tagging for Japanese Microblogs: Corpus Annotation and Joint Modeling with Lexical Normalization
Nobuhiro Kaji | Masaru Kitsuregawa
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf
Collective Sentiment Classification Based on User Leniency and Product Popularity
Wenliang Gao | Naoki Yoshinaga | Nobuhiro Kaji | Masaru Kitsuregawa
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)

pdf
Efficient Word Lattice Generation for Joint Word Segmentation and POS Tagging in Japanese
Nobuhiro Kaji | Masaru Kitsuregawa
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Modeling User Leniency and Product Popularity for Sentiment Classification
Wenliang Gao | Naoki Yoshinaga | Nobuhiro Kaji | Masaru Kitsuregawa
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2011

pdf
Sentiment Classification in Resource-Scarce Languages by using Label Propagation
Yong Ren | Nobuhiro Kaji | Naoki Yoshinaga | Masashi Toyoda | Masaru Kitsuregawa
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

pdf
Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words
Nobuhiro Kaji | Masaru Kitsuregawa
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf
Kernel Slicing: Scalable Online Training with Conjunctive Features
Naoki Yoshinaga | Masaru Kitsuregawa
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
Efficient Staggered Decoding for Sequence Labeling
Nobuhiro Kaji | Yasuhiro Fujiwara | Naoki Yoshinaga | Masaru Kitsuregawa
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

2009

pdf
A Combination of Active Learning and Semi-supervised Learning Starting with Positive and Unlabeled Examples for Word Sense Disambiguation: An Empirical Study on Japanese Web Search Query
Makoto Imamura | Yasuhiro Takayama | Nobuhiro Kaji | Masashi Toyoda | Masaru Kitsuregawa
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf
Polynomial to Linear: Efficient Classification with Conjunctive Features
Naoki Yoshinaga | Masaru Kitsuregawa
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

2008

pdf
Using Hidden Markov Random Fields to Combine Distributional and Pattern-Based Word Clustering
Nobuhiro Kaji | Masaru Kitsuregawa
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf
Building Lexicon for Sentiment Analysis from Massive Collection of HTML Documents
Nobuhiro Kaji | Masaru Kitsuregawa
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf
Automatic Construction of Polarity-Tagged Corpus from HTML Documents
Nobuhiro Kaji | Masaru Kitsuregawa
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions