Abstract
State-of-the-art Natural Language Processing algorithms rely heavily on efficient word segmentation. Urdu is amongst languages for which word segmentation is a complex task as it exhibits space omission as well as space insertion issues. This is partly due to the Arabic script which although cursive in nature, consists of characters that have inherent joining and non-joining attributes regardless of word boundary. This paper presents a word segmentation system for Urdu which uses a Conditional Random Field sequence modeler with orthographic, linguistic and morphological features. Our proposed model automatically learns to predict white space as word boundary as well as Zero Width Non-Joiner (ZWNJ) as sub-word boundary. Using a manually annotated corpus, our model achieves F1 score of 0.97 for word boundary identification and 0.85 for sub-word boundary identification tasks. We have made our code and corpus publicly available to make our results reproducible.- Anthology ID:
- C18-1217
- Volume:
- Proceedings of the 27th International Conference on Computational Linguistics
- Month:
- August
- Year:
- 2018
- Address:
- Santa Fe, New Mexico, USA
- Editors:
- Emily M. Bender, Leon Derczynski, Pierre Isabelle
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2562–2569
- Language:
- URL:
- https://aclanthology.org/C18-1217
- DOI:
- Cite (ACL):
- Haris Bin Zia, Agha Ali Raza, and Awais Athar. 2018. Urdu Word Segmentation using Conditional Random Fields (CRFs). In Proceedings of the 27th International Conference on Computational Linguistics, pages 2562–2569, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Cite (Informal):
- Urdu Word Segmentation using Conditional Random Fields (CRFs) (Bin Zia et al., COLING 2018)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/C18-1217.pdf
- Code
- harisbinzia/Urdu-Word-Segmentation