fugashi, a Tool for Tokenizing Japanese in Python

Paul McCann

doi:10.18653/v1/2020.nlposs-1.7

fugashi, a Tool for Tokenizing Japanese in Python

Abstract

Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.

Anthology ID:: 2020.nlposs-1.7
Volume:: Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)
Month:: November
Year:: 2020
Address:: Online
Venue:: NLPOSS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 44–51
Language:
URL:: https://aclanthology.org/2020.nlposs-1.7
DOI:: 10.18653/v1/2020.nlposs-1.7
Bibkey:
Cite (ACL):: Paul McCann. 2020. fugashi, a Tool for Tokenizing Japanese in Python. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 44–51, Online. Association for Computational Linguistics.
Cite (Informal):: fugashi, a Tool for Tokenizing Japanese in Python (McCann, NLPOSS 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/paclic-22-ingestion/2020.nlposs-1.7.pdf
Video:: https://slideslive.com/38939744
Code: polm/fugashi

PDF Search Code Video