Abstract
This paper presents a novel approach to the segmentation of orthographic word forms in contemporary Hebrew, focusing purely on splitting without carrying out morphological analysis or disambiguation. Casting the analysis task as character-wise binary classification and using adjacent character and word-based lexicon-lookup features, this approach achieves over 98% accuracy on the benchmark SPMRL shared task data for Hebrew, and 97% accuracy on a new out of domain Wikipedia dataset, an improvement of ≈4% and 5% over previous state of the art performance.- Anthology ID:
- W18-5811
- Volume:
- Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology
- Month:
- October
- Year:
- 2018
- Address:
- Brussels, Belgium
- Venue:
- EMNLP
- SIG:
- SIGMORPHON
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 101–110
- Language:
- URL:
- https://aclanthology.org/W18-5811
- DOI:
- 10.18653/v1/W18-5811
- Cite (ACL):
- Amir Zeldes. 2018. A Characterwise Windowed Approach to Hebrew Morphological Segmentation. In Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 101–110, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- A Characterwise Windowed Approach to Hebrew Morphological Segmentation (Zeldes, EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/W18-5811.pdf
- Code
- amir-zeldes/RFTokenizer
- Data
- Wiki5K Hebrew segmentation, SPMRL Hebrew segmentation data