Kazuya Kawahara


2016

pdf
BonTen’ – Corpus Concordance System for ‘NINJAL Web Japanese Corpus’
Masayuki Asahara | Kazuya Kawahara | Yuya Takei | Hideto Masuoka | Yasuko Ohba | Yuki Torii | Toru Morii | Yuki Tanaka | Kikuo Maekawa | Sachi Kato | Hikari Konishi
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

The National Institute for Japanese Language and Linguistics, Japan (NINJAL) has undertaken a corpus compilation project to construct a web corpus for linguistic research comprising ten billion words. The project is divided into four parts: page collection, linguistic analysis, development of the corpus concordance system, and preservation. This article presents the corpus concordance system named ‘BonTen’ which enables the ten-billion-scaled corpus to be queried by string, a sequence of morphological information or a subtree of the syntactic dependency structure.