Hidetoshi Shimodaira


Segmentation-free compositional n-gram embedding
Geewook Kim | Kazuki Fukui | Hidetoshi Shimodaira
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a new type of representation learning method that models words, phrases and sentences seamlessly. Our method does not depend on word segmentation and any human-annotated resources (e.g., word dictionaries), yet it is very effective for noisy corpora written in unsegmented languages such as Chinese and Japanese. The main idea of our method is to ignore word boundaries completely (i.e., segmentation-free), and construct representations for all character n-grams in a raw corpus with embeddings of compositional sub-n-grams. Although the idea is simple, our experiments on various benchmarks and real-world datasets show the efficacy of our proposal.


Word-like character n-gram embedding
Geewook Kim | Kazuki Fukui | Hidetoshi Shimodaira
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

We propose a new word embedding method called word-like character n-gram embedding, which learns distributed representations of words by embedding word-like character n-grams. Our method is an extension of recently proposed segmentation-free word embedding, which directly embeds frequent character n-grams from a raw corpus. However, its n-gram vocabulary tends to contain too many non-word n-grams. We solved this problem by introducing an idea of expected word frequency. Compared to the previously proposed methods, our method can embed more words, along with the words that are not included in a given basic word dictionary. Since our method does not rely on word segmentation with rich word dictionaries, it is especially effective when the text in the corpus is in unsegmented language and contains many neologisms and informal words (e.g., Chinese SNS dataset). Our experimental results on Sina Weibo (a Chinese microblog service) and Twitter show that the proposed method can embed more words and improve the performance of downstream tasks.


Spectral Graph-Based Method of Multimodal Word Embedding
Kazuki Fukui | Takamasa Oshikiri | Hidetoshi Shimodaira
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing

In this paper, we propose a novel method for multimodal word embedding, which exploit a generalized framework of multi-view spectral graph embedding to take into account visual appearances or scenes denoted by words in a corpus. We evaluated our method through word similarity tasks and a concept-to-image search task, having found that it provides word representations that reflect visual information, while somewhat trading-off the performance on the word similarity tasks. Moreover, we demonstrate that our method captures multimodal linguistic regularities, which enable recovering relational similarities between words and images by vector arithmetics.


Cross-Lingual Word Representations via Spectral Graph Embeddings
Takamasa Oshikiri | Kazuki Fukui | Hidetoshi Shimodaira
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)