Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT

Ehsan Doostmohammadi, Minoo Nassajian, Adel Rahimi


Abstract
Words are properly segmented in the Persian writing system; in practice, however, these writing rules are often neglected, resulting in single words being written disjointedly and multiple words written without any white spaces between them. This paper addresses the problems of word segmentation and zero-width non-joiner (ZWNJ) recognition in Persian, which we approach jointly as a sequence labeling problem. We achieved a macro-averaged F1-score of 92.40% on a carefully collected corpus of 500 sentences with a high level of difficulty.
Anthology ID:
2020.coling-main.406
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
4612–4618
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2020.coling-main.406/
DOI:
10.18653/v1/2020.coling-main.406
Bibkey:
Cite (ACL):
Ehsan Doostmohammadi, Minoo Nassajian, and Adel Rahimi. 2020. Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4612–4618, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT (Doostmohammadi et al., COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2020.coling-main.406.pdf