Abstract
Many NLP applications need fundamental tools to convert the input text into appropriate form or format and extract the primary linguistic knowledge of words and sentences. These tools perform segmentation of text into sentences, words and phrases, checking and correcting the spellings, doing lexical and morphological analysis, POS tagging and so on. Persian is among languages with complex preprocessing tasks. Having different writing prescriptions, spacings between or within words, character codings and spellings are some of the difficulties and challenges in converting various texts into a standard one. The lack of fundamental text processing tools such as morphological analyser (especially for derivational morphology) and POS tagger is another problem in Persian text processing. This paper introduces a set of fundamental tools for Persian text processing in STeP-1 package. STeP-1 (Standard Text Preparation for Persian language) performs a combination of tokenization, spell checking, morphological analysis and POS tagging. It also turns all Persian texts with different prescribed forms of writing to a series of tokens in the standard style introduced by Academy of Persian Language and Literature (APLL). Experimental results show high performance.- Anthology ID:
- L10-1557
- Volume:
- Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
- Month:
- May
- Year:
- 2010
- Address:
- Valletta, Malta
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2010/pdf/809_Paper.pdf
- DOI:
- Cite (ACL):
- Mehrnoush Shamsfard, Hoda Sadat Jafari, and Mahdi Ilbeygi. 2010. STeP-1: A Set of Fundamental Tools for Persian Text Processing. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
- Cite (Informal):
- STeP-1: A Set of Fundamental Tools for Persian Text Processing (Shamsfard et al., LREC 2010)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2010/pdf/809_Paper.pdf