An Open, Extendible, and Fast Turkish Morphological Analyzer

Olcay Taner Yıldız, Begüm Avar, Gökhan Ercan


Abstract
In this paper, we present a two-level morphological analyzer for Turkish. The morphological analyzer consists of five main components: finite state transducer, rule engine for suffixation, lexicon, trie data structure, and LRU cache. We use Java language to implement finite state machine logic and rule engine, Xml language to describe the finite state transducer rules of the Turkish language, which makes the morphological analyzer both easily extendible and easily applicable to other languages. Empowered with the comprehensiveness of a lexicon of 54,000 bare-forms including 19,000 proper nouns, our morphological analyzer presents one of the most reliable analyzers produced so far. The analyzer is compared with Turkish morphological analyzers in the literature. By using LRU cache and a trie data structure, the system can analyze 100,000 words per second, which enables users to analyze huge corpora in a few hours.
Anthology ID:
R19-1156
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
1364–1372
Language:
URL:
https://aclanthology.org/R19-1156
DOI:
10.26615/978-954-452-056-4_156
Bibkey:
Cite (ACL):
Olcay Taner Yıldız, Begüm Avar, and Gökhan Ercan. 2019. An Open, Extendible, and Fast Turkish Morphological Analyzer. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 1364–1372, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
An Open, Extendible, and Fast Turkish Morphological Analyzer (Yıldız et al., RANLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/R19-1156.pdf