Linguistic Analysis Processing Line for Bulgarian
Aleksandar Savkov, Laska Laskova, Stanislava Kancheva, Petya Osenova, Kiril Simov
Abstract
This paper presents a linguistic processing pipeline for Bulgarian including morphological analysis, lemmatization and syntactic analysis of Bulgarian texts. The morphological analysis is performed by three modules ― two statistical-based and one rule-based. The combination of these modules achieves the best result for morphological tagging of Bulgarian over a rich tagset (680 tags). The lemmatization is based on rules, generated from a large morphological lexicon of Bulgarian. The syntactic analysis is implemented via MaltParser. The two statistical morphological taggers and MaltParser are trained on datasets constructed within BulTreeBank project. The processing pipeline includes also a sentence splitter and a tokenizer. All tools in the pipeline are packed in modules that can also perform separately. The whole pipeline is designed to be able to serve as a back-end of a web service oriented interface, but it also supports the user tasks with a command-line interface. The processing pipeline is compatible with the Text Corpus Format, which allows it to delegate the management of the components to the WebLicht platform.- Anthology ID:
- L12-1494
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2959–2964
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/829_Paper.pdf
- DOI:
- Cite (ACL):
- Aleksandar Savkov, Laska Laskova, Stanislava Kancheva, Petya Osenova, and Kiril Simov. 2012. Linguistic Analysis Processing Line for Bulgarian. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2959–2964, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- Linguistic Analysis Processing Line for Bulgarian (Savkov et al., LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/829_Paper.pdf