A Language Resources Infrastructure for Bulgarian

Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff


Abstract
This paper describes the infrastructure of a basic language resources set for Bulgarian in the context of BLARK initiative requirements. We focus on the treebanking task as a trigger for basic language resources compilation. Two strategies have been applied in this respect: (1) implementing the main pre-processing modules before the treebank compilation and (2) creating more elaborate types of resources in parallel to the treebank compilation. The description of language resources within BulTreeBank project is divided into two parts: language technology, which includes tokenization, morphosyntactic analyzer, morphosyntactic disambiguation, partial grammars, and language data, which includes the layers of the BulTreeBank corpus and the variety of lexicons. The advantages of our approach to a less-spoken language (like Bulgarian) are as follows: it triggers the creation of the basic set of language resources which lack for certain languages and it rises the question about the ways of language resources creation.
Anthology ID:
L04-1171
Volume:
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Month:
May
Year:
2004
Address:
Lisbon, Portugal
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2004/pdf/316.pdf
DOI:
Bibkey:
Cite (ACL):
Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, and Dimitar Doikoff. 2004. A Language Resources Infrastructure for Bulgarian. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
Cite (Informal):
A Language Resources Infrastructure for Bulgarian (Simov et al., LREC 2004)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2004/pdf/316.pdf