Abstract
Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.- Anthology ID:
- 2020.nlposs-1.7
- Volume:
- Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- NLPOSS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 44–51
- Language:
- URL:
- https://aclanthology.org/2020.nlposs-1.7
- DOI:
- 10.18653/v1/2020.nlposs-1.7
- Cite (ACL):
- Paul McCann. 2020. fugashi, a Tool for Tokenizing Japanese in Python. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 44–51, Online. Association for Computational Linguistics.
- Cite (Informal):
- fugashi, a Tool for Tokenizing Japanese in Python (McCann, NLPOSS 2020)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2020.nlposs-1.7.pdf
- Code
- polm/fugashi