Sia Kolkovska

Also published as: Siya Kolkovska

2010

pdf abs
Bulgarian National Corpus Project
Svetla Koeva | Diana Blagoeva | Siya Kolkovska
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper presents Bulgarian National Corpus project (BulNC) - a large-scale, representative, online available corpus of Bulgarian. The BulNC is also a monolingual general corpus, fully morpho-syntactically (and partially semantically) annotated, and manually provided with detailed meta-data descriptions. Presently the Bulgarian National corpus consists of about 320 000 000 graphical words and includes more than 10 000 samples. Briefly the corpus structure and the accepted criteria for representativeness and well-balancing are presented. The query language for advance search of collocations and concordances is demonstrated with some examples - it allows to retrieve word combinations, ordered queries, inflexionally and semantically related words, part-of-speech tags, utilising Boolean operations and grouping as well. The BulNC already plays a significant role in natural language processing of Bulgarian contributing to scientific advances in spelling and grammar checking, word sense disambiguation, speech recognition, text categorisation, topic extraction and machine translation. The BulNC can also be used in different investigations going beyond the linguistics: library studies, social sciences research, teaching methods studies, etc.

2004

pdf abs
A Language Resources Infrastructure for Bulgarian
Kiril Simov | Petya Osenova | Sia Kolkovska | Elisaveta Balabanova | Dimitar Doikoff
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper describes the infrastructure of a basic language resources set for Bulgarian in the context of BLARK initiative requirements. We focus on the treebanking task as a trigger for basic language resources compilation. Two strategies have been applied in this respect: (1) implementing the main pre-processing modules before the treebank compilation and (2) creating more elaborate types of resources in parallel to the treebank compilation. The description of language resources within BulTreeBank project is divided into two parts: language technology, which includes tokenization, morphosyntactic analyzer, morphosyntactic disambiguation, partial grammars, and language data, which includes the layers of the BulTreeBank corpus and the variety of lexicons. The advantages of our approach to a less-spoken language (like Bulgarian) are as follows: it triggers the creation of the basic set of language resources which lack for certain languages and it rises the question about the ways of language resources creation.

2002

Co-authors

Venues

lrec3