Kiril Simov

Also published as: Kiril Iv. Simov

2024

pdf abs
Bulgarian ParlaMint 4.0 corpus as a testset for Part-of-speech tagging and Named Entity Recognition
Petya Osenova | Kiril Simov
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

The paper discusses some fine-tuned models for the tasks of part-of-speech tagging and named entity recognition. The fine-tuning was performed on the basis of an existing BERT pre-trained model and two newly pre-trained BERT models for Bulgarian that are cross-tested on the domain of the Bulgarian part of the ParlaMint corpora as a new domain. In addition, a comparison has been made between the performance of the new fine-tuned BERT models and the available results from the Stanza-based model which the Bulgarian part of the ParlaMint corpora has been annotated with. The observations show the weaknesses in each model as well as the common challenges.

2023

pdf abs
Transformer-Based Language Models for Bulgarian
Iva Marinova | Kiril Simov | Petya Osenova
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This paper presents an approach for training lightweight and robust language models for Bulgarian that mitigate gender, political, racial, and other biases in the data. Our method involves scraping content from major Bulgarian online media providers using a specialized procedure for source filtering, topic selection, and lexicon-based removal of inappropriate language during the pre-training phase. We continuously improve the models by incorporating new data from various domains, including social media, books, scientific literature, and linguistically modified corpora. Our motivation is to provide a solution that is sufficient for all natural language processing tasks in Bulgarian, and to address the lack of existing procedures for guaranteeing the robustness of such models.

pdf abs
Recent Developments in BTB-WordNet
Kiril Simov | Petya Osenova
Proceedings of the 12th Global Wordnet Conference

The paper reports on recent developments in Bulgarian BTB-WordNet (BTB-WN). This resource is viewed as playing a central role with respect to the integration and interlinking of various language resources such as: e-dictionaries (morphological, terminological, bilingual, orthographic, etymological and explanatory, etc., including editions from previous periods); corpora (coming from outside or being internal - like the corpus of definitions as well as the corpus of examples to synset meanings); ontologies (such as CIDOC-CRM, DBpedia, etc.); sources of world knowledge (such as information from the Bulgarian Encyclopedia, Wikipedia, etc.). The paper also gives information about a number of applications built on BTB-WN. These are: the Bulgaria-centered knowledge graph, the All about word application as well as some education-oriented exercises.

We present bgGLUE (Bulgarian General Language Understanding Evaluation), a benchmark for evaluating language models on Natural Language Understanding (NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety of NLP problems (e.g., natural language inference, fact-checking, named entity recognition, sentiment analysis, question answering, etc.) and machine learning tasks (sequence labeling, document-level classification, and regression). We run the first systematic evaluation of pre-trained language models for Bulgarian, comparing and contrasting results across the nine tasks in the benchmark. The evaluation results show strong performance on sequence labeling tasks, but there is a lot of room for improvement for tasks that require more complex reasoning. We make bgGLUE publicly available together with the fine-tuning and the evaluation code, as well as a public leaderboard at https://bgglue.github.io, and we hope that it will enable further advancements in developing NLU models for Bulgarian.

2022

pdf abs
The Bulgarian Event Corpus: Overview and Initial NER Experiments
Petya Osenova | Kiril Simov | Iva Marinova | Melania Berbatova
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The paper describes the Bulgarian Event Corpus (BEC). The annotation scheme is based on CIDOC-CRM ontology and on the English Framenet, adjusted for our task. It includes two main layers: named entities and events with their roles. The corpus is multi-domain and mainly oriented towards Social Sciences and Humanities (SSH). It will be used for: extracting knowledge and making it available through the Bulgaria-centric Knowledge Graph; further developing an annotation scheme that handles multiple domains in SSH; training automatic modules for the most important knowledge-based tasks, such as domain-specific and nested NER, NEL, event detection and profiling. Initial experiments were conducted on standard NER task due to complexity of the dataset and the rich NE annotation scheme. The results are promising with respect to some labels and give insights on handling better other ones. These experiments serve also as error detection modules that would help us in scheme re-design. They are a basis for further and more complex tasks, such as nested NER, NEL and event detection.

2021

pdf abs
Monitoring Fact Preservation, Grammatical Consistency and Ethical Behavior of Abstractive Summarization Neural Models
Iva Marinova | Yolina Petrova | Milena Slavcheva | Petya Osenova | Ivaylo Radev | Kiril Simov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The paper describes a system for automatic summarization in English language of online news data that come from different non-English languages. The system is designed to be used in production environment for media monitoring. Automatic summarization can be very helpful in this domain when applied as a helper tool for journalists so that they can review just the important information from the news channels. However, like every software solution, the automatic summarization needs performance monitoring and assured safe environment for the clients. In media monitoring environment the most problematic features to be addressed are: the copyright issues, the factual consistency, the style of the text and the ethical norms in journalism. Thus, the main contribution of our present work is that the above mentioned characteristics are successfully monitored in neural automatic summarization models and improved with the help of validation, fact-preserving and fact-checking procedures.

2020

Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.

pdf abs
Reconstructing NER Corpora: a Case Study on Bulgarian
Iva Marinova | Laska Laskova | Petya Osenova | Kiril Simov | Alexander Popov
Proceedings of the Twelfth Language Resources and Evaluation Conference

The paper reports on the usage of deep learning methods for improving a Named Entity Recognition (NER) training corpus and for predicting and annotating new types in a test corpus. We show how the annotations in a type-based corpus of named entities (NE) were populated as occurrences within it, thus ensuring density of the training information. A deep learning model was adopted for discovering inconsistencies in the initial annotation and for learning new NE types. The evaluation results get improved after data curation, randomization and deduplication.

pdf
Implementing an End-to-End Treebank-Informed Pipeline for Bulgarian
Alexander Popov | Petya Osenova | Kiril Simov
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

2019

pdf abs
Modeling MWEs in BTB-WN
Laska Laskova | Petya Osenova | Kiril Simov | Ivajlo Radev | Zara Kancheva
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

The paper presents the characteristics of the predominant types of MultiWord expressions (MWEs) in the BulTreeBank WordNet – BTB-WN. Their distribution in BTB-WN is discussed with respect to the overall hierarchical organization of the lexical resource. Also, a catena-based modeling is proposed for handling the issues of lexical semantics of MWEs.

pdf
Towards transferring Bulgarian Sentences with Elliptical Elements to Universal Dependencies: issues and strategies
Petya Osenova | Kiril Simov
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

pdf abs
Aligning the Bulgarian BTB WordNet with the Bulgarian Wikipedia
Kiril Simov | Petya Osenova | Laska Laskova | Ivajlo Radev | Zara Kancheva
Proceedings of the 10th Global Wordnet Conference

The paper reports on an ongoing work that manually maps the Bulgarian WordNet BTB-WN with Bulgarian Wikipedia. The preparatory work of extracting the Wikipedia articles and provisionally relating them to the WordNet lemmas was done automatically. The manual work includes checking of the corresponding senses in both resources as well as the missing ones. The main cases of mapping are considered. The first experiments of mapping about 1000 synsets show the establishment of more than 78 % of exact correspondences and nearly 15 % of new synsets.

pdf abs
Know Your Graph. State-of-the-Art Knowledge-Based WSD
Alexander Popov | Kiril Simov | Petya Osenova
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper introduces several improvements over the current state of the art in knowledge-based word sense disambiguation. Those innovations are the result of modifying and enriching a knowledge base created originally on the basis of WordNet. They reflect several separate but connected strategies: manipulating the shape and the content of the knowledge base, assigning weights over the relations in the knowledge base, and the addition of new relations to it. The main contribution of the paper is to demonstrate that the previously proposed knowledge bases organize linguistic and world knowledge suboptimally for the task of word sense disambiguation. In doing so, the paper also establishes a new state of the art for knowledge-based approaches. Its best models are competitive in the broader context of supervised systems as well.

pdf abs
A Morpho-Syntactically Informed LSTM-CRF Model for Named Entity Recognition
Lilia Simeonova | Kiril Simov | Petya Osenova | Preslav Nakov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We propose a morphologically informed model for named entity recognition, which is based on LSTM-CRF architecture and combines word embeddings, Bi-LSTM character embeddings, part-of-speech (POS) tags, and morphological information. While previous work has focused on learning from raw word input, using word and character embeddings only, we show that for morphologically rich languages, such as Bulgarian, access to POS information contributes more to the performance gains than the detailed morphological information. Thus, we show that named entity recognition needs only coarse-grained POS tags, but at the same time it can benefit from simultaneously using some POS information of different granularity. Our evaluation results over a standard dataset show sizeable improvements over the state-of-the-art for Bulgarian NER.

2018

pdf abs
Grammatical Role Embeddings for Enhancements of Relation Density in the Princeton WordNet
Kiril Simov | Alexander Popov | Iliana Simova | Petya Osenova
Proceedings of the 9th Global Wordnet Conference

In this paper we present an approach for training verb subatom embeddings. For each verb we learn several embeddings rather than only one. These embeddings include the verb itself as well as embeddings for each grammatical role of this verb. To give an example, for the verb ‘to give’ we learn four embeddings: one for the lemma ‘give’, one for the subject, one for the direct object and one for the indirect object. We have exploited these grammatical role embeddings in order to add new syntagmatic relations to WordNet. The evaluation of the new relations quality has been done extrinsically through the Knowledge-based Word Sense Disambiguation task.

2017

pdf
Recent Developments within BulTreeBank
Petya Osenova | Kiril Simov
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

pdf abs
Annotation of Clinical Narratives in Bulgarian language
Ivajlo Radev | Kiril Simov | Galia Angelova | Svetla Boytcheva
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017

In this paper we describe annotation process of clinical texts with morphosyntactic and semantic information. The corpus contains 1,300 discharge letters in Bulgarian language for patients with Endocrinology and Metabolic disorders. The annotated corpus will be used as a Gold standard for information extraction evaluation of test corpus of 6,200 discharge letters. The annotation is performed within Clark system — an XML Based System For Corpora Development. It provides mechanism for semi-automatic annotation first running a pipeline for Bulgarian morphosyntactic annotation and a cascaded regular grammar for semantic annotation is run, then rules for cleaning of frequent errors are applied. At the end the result is manually checked. At the end we hope also to be able to adapted the morphosyntactic tagger to the domain of clinical narratives as well.

pdf abs
Bulgarian-English and English-Bulgarian Machine Translation: System Design and Evaluation
Petya Osenova | Kiril Simov
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

The paper presents a deep factored machine translation (MT) system between English and Bulgarian languages in both directions. The MT system is hybrid. It consists of three main steps: (1) the source-language text is linguistically annotated, (2) it is translated to the target language with the Moses system, and (3) translation is post-processed with the help of the transferred linguistic annotation from the source text. Besides automatic evaluation we performed manual evaluation over a domain test suite of sentences demonstrating certain phenomena like imperatives, questions, etc.

pdf abs
Towards Lexical Chains for Knowledge-Graph-based Word Embeddings
Kiril Simov | Svetla Boytcheva | Petya Osenova
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Word vectors with varying dimensionalities and produced by different algorithms have been extensively used in NLP. The corpora that the algorithms are trained on can contain either natural language text (e.g. Wikipedia or newswire articles) or artificially-generated pseudo corpora due to natural data sparseness. We exploit Lexical Chain based templates over Knowledge Graph for generating pseudo-corpora with controlled linguistic value. These corpora are then used for learning word embeddings. A number of experiments have been conducted over the following test sets: WordSim353 Similarity, WordSim353 Relatedness and SimLex-999. The results show that, on the one hand, the incorporation of many-relation lexical chains improves results, but on the other hand, unrestricted-length chains remain difficult to handle with respect to their huge quantity.

2016

This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.

pdf abs
The Role of the WordNet Relations in the Knowledge-based Word Sense Disambiguation Task
Kiril Simov | Alexander Popov | Petya Osenova
Proceedings of the 8th Global WordNet Conference (GWC)

In this paper we present an analysis of different semantic relations extracted from WordNet, Extended WordNet and SemCor, with respect to their role in the task of knowledge-based word sense disambiguation. The experiments use the same algorithm and the same test sets, but different variants of the knowledge graph. The results show that different sets of relations have different impact on the results: positive or negative. The beneficial ones are discussed with respect to the combination of relations and with respect to the test set. The inclusion of inference has only a modest impact on accuracy, while the addition of syntactic relations produces stable improvement over the baselines.

pdf
Towards Semantic-based Hybrid Machine Translation between Bulgarian and English
Kiril Simov | Petya Osenova | Alexander Popov
Proceedings of the 2nd Workshop on Semantics-Driven Machine Translation (SedMT 2016)

pdf
A Hybrid Approach for Deep Machine Translation
Kiril Simov | Petya Osenova
Proceedings of the 2nd Deep Machine Translation Workshop

2015

pdf
Improving Word Sense Disambiguation with Linguistic Knowledge from a Sense Annotated Treebank
Kiril Simov | Alexander Popov | Petya Osenova
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
Training Automatic Transliteration Models on DBPedia Data
Velislava Todorova | Kiril Simov
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
Catena Operations for Unified Dependency Analysis
Kiril Simov | Petya Osenova
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

pdf
Universalizing BulTreeBank: a Linguistic Tale about Glocalization
Petya Osenova | Kiril Simov
The 5th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data
Piek Vossen | German Rigau | Petya Osenova | Kiril Simov
Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data

pdf
Accessing Linked Open Data via A Common Ontology
Kiril Simov | Atanas Kiryakov
Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data

pdf
Factored models for Deep Machine Translation
Kiril Simov | Iliana Simova | Velislava Todorova | Petya Osenova
Proceedings of the 1st Deep Machine Translation Workshop

2014

pdf abs
A System for Experiments with Dependency Parsers
Kiril Simov | Iliana Simova | Ginka Ivanova | Maria Mateva | Petya Osenova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present a system for experimenting with combinations of dependency parsers. The system supports initial training of different parsing models, creation of parsebank(s) with these models, and different strategies for the construction of ensemble models aimed at improving the output of the individual models by voting. The system employs two algorithms for construction of dependency trees from several parses of the same sentence and several ways for ranking of the arcs in the resulting trees. We have performed experiments with state-of-the-art dependency parsers including MaltParser, MSTParser, TurboParser, and MATEParser, on the data from the Bulgarian treebank – BulTreeBank. Our best result from these experiments is slightly better then the best result reported in the literature for this language.

pdf abs
Constituency Parsing of Bulgarian: Word- vs Class-based Parsing
Masood Ghayoomi | Kiril Simov | Petya Osenova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we report the obtained results of two constituency parsers trained with BulTreeBank, an HPSG-based treebank for Bulgarian. To reduce the data sparsity problem, we propose using the Brown word clustering to do an off-line clustering and map the words in the treebank to create a class-based treebank. The observations show that when the classes outnumber the POS tags, the results are better. Since this approach adds on another dimension of abstraction (in comparison to the lemma), its coarse-grained representation can be used further for training statistical parsers.

pdf bib
Joint Ensemble Model for POS Tagging and Dependency Parsing
Iliana Simova | Dimitar Vasilev | Alexander Popov | Kiril Simov | Petya Osenova
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

2013

pdf
Combining POS Tagging, Dependency Parsing and Coreferential Resolution for Bulgarian
Valentin Zhikov | Georgi Georgiev | Kiril Simov | Petya Osenova
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Invited Talk: Ontologies and Linked Open Data for Acquisition and Exploitation of Language Resources
Kiril Simov
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction
Diana Maynard | Marieke van Erp | Brian Davis | Petya Osenova | Kiril Simov | Georgi Georgiev | Preslav Nakov
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction

pdf
Towards a System for Dynamic Language Resources in LOD
Kiril Simov
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction

2012

pdf
Linguistically-Augmented Bulgarian-to-English Statistical Machine Translation Model
Rui Wang | Petya Osenova | Kiril Simov
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

pdf bib
Linguistically-Enriched Models for Bulgarian-to-English Machine Translation
Rui Wang | Petya Osenova | Kiril Simov
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf abs
A Treebank-driven Creation of an OntoValence Verb lexicon for Bulgarian
Petya Osenova | Kiril Simov | Laska Laskova | Stanislava Kancheva
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper presents a treebank-driven approach to the construction of a Bulgarian valence lexicon with ontological restrictions over the inner participants of the event. First, the underlying ideas behind the Bulgarian Ontology-based lexicon are outlined. Then, the extraction and manipulation of the valence frames is discussed with respect to the BulTreeBank annotation scheme and DOLCE ontology. Also, the most frequent types of syntactic frames are specified as well as the most frequent types of ontological restrictions over the verb arguments. The envisaged application of such a lexicon would be: in assigning ontological labels to syntactically parsed corpora, and expanding the lexicon and lexical information in the Bulgarian Resource Grammar.

pdf abs
Linguistic Analysis Processing Line for Bulgarian
Aleksandar Savkov | Laska Laskova | Stanislava Kancheva | Petya Osenova | Kiril Simov
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents a linguistic processing pipeline for Bulgarian including morphological analysis, lemmatization and syntactic analysis of Bulgarian texts. The morphological analysis is performed by three modules ― two statistical-based and one rule-based. The combination of these modules achieves the best result for morphological tagging of Bulgarian over a rich tagset (680 tags). The lemmatization is based on rules, generated from a large morphological lexicon of Bulgarian. The syntactic analysis is implemented via MaltParser. The two statistical morphological taggers and MaltParser are trained on datasets constructed within BulTreeBank project. The processing pipeline includes also a sentence splitter and a tokenizer. All tools in the pipeline are packed in modules that can also perform separately. The whole pipeline is designed to be able to serve as a back-end of a web service oriented interface, but it also supports the user tasks with a command-line interface. The processing pipeline is compatible with the Text Corpus Format, which allows it to delegate the management of the components to the WebLicht platform.

pdf abs
The Political Speech Corpus of Bulgarian
Petya Osenova | Kiril Simov
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper introduces the Political Speech Corpus of Bulgarian. First, its current state has been discussed with respect to its size, coverage, genre specification and related online services. Then, the focus goes to the annotation details. On the one hand, the layers of linguistic annotation are presented. On the other hand, the compatibility with CLARIN technical Infrastructure is explained. Also, some user-based scenarios are mentioned to demonstrate the corpus services and applicability.

pdf
Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgarian
Georgi Georgiev | Valentin Zhikov | Kiril Simov | Petya Osenova | Preslav Nakov
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf
Towards Minimal Recursion Semantics over Bulgarian Dependency Parsing
Kiril Simov | Petya Osenova
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf
Language Technology Support for Semantic Annotation of Icono-graphic Descriptions
Kamenka Staykova | Gennady Agre | Kiril Simov | Petya Osenova
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage

pdf bib
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora
Kiril Simov | Petya Osenova | Jörg Tiedemann | Radovan Garabik
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora

pdf
Bulgarian-English Parallel Treebank: Word and Semantic Level Alignment
Kiril Simov | Petya Osenova | Laska Laskova | Aleksandar Savkov | Stanislava Kancheva
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora

2010

pdf abs
Exploring Co-Reference Chains for Concept Annotation of Domain Texts
Petya Osenova | Laska Laskova | Kiril Simov
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper explores the co-reference chains as a way for improving the density of concept annotation over domain texts. The idea extends authors previous work on relating the ontology to the text terms in two domains ― IT and textile. Here IT domain is used. The challenge is to enhance relations among concepts instead of text entities, the latter pursued in most works. Our ultimate goal is to exploit these additional chains for concept disambiguation as well as sparseness resolution at concept level. First, a gold standard was prepared with manually connected links among concepts, anaphoric pronouns and contextual equivalents. This step was necessary not only for test purposes, but also for better orientation in the co-referent types and distribution. Then, two automatic systems were tested on the gold standard. Note that these systems were not designed specially for concept chaining. The conclusion is that the state-of-the-art co-reference resolution systems might address the concept sparseness problem, but not so much the concept disambiguation task. For the latter, word-sense disambiguation systems have to be integrated.

pdf abs
Constructing of an Ontology-based Lexicon for Bulgarian
Kiril Simov | Petya Osenova
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we report on the progress in the creation of an Ontology-based lexicon for Bulgarian. We have started with the concept set from an upper ontology (DOLCE). Then it was extended with concepts selected from the OntoWordNet, which correspond to Core WordNet and EuroWordNet Basic concepts. The underlying idea behind the ontology-based lexicon is its organization via two semantic relations - equivalence and subsumption. These relations reflect the distribution of lexical unit senses with respect to the concepts in the ontology. The lexical unit candidates for concept mapping have been selected from two large and well-developed lexical resources for Bulgarian - a machine readable explanatory dictionary and a morphological lexicon. In the initial step, the lexical units were handled that have equivalent senses to the concepts in the ontology (2500 at the moment). Then, in the second stage, we are proceeding with lexical units selected on their frequency distribution in a large Bulgarian corpus. This step is the more challenging one, since it might require also additions of concepts to the ontology. The main applications of the lexicon are envisaged to be the semantic annotation and semantic IR for Bulgarian.

2009

pdf
A Web-Enabled and Speech-Enhanced Parallel Corpus of Greek-Bulgarian Cultural Texts
Voula Giouli | Nikos Glaros | Kiril Simov | Petya Osenova
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

pdf bib
Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains
Núria Bel | Erhard Hinrichs | Petya Osenova | Kiril Simov
Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains

pdf
Cross-lingual Adaptation as a Baseline: Adapting Maximum Entropy Models to Bulgarian
Georgi Georgiev | Preslav Nakov | Petya Osenova | Kiril Simov
Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains

pdf
Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields
Georgi Georgiev | Preslav Nakov | Kuzman Ganchev | Petya Osenova | Kiril Simov
Proceedings of the International Conference RANLP-2009

2008

pdf abs
Language Resources for Semantic Document Annotation and Crosslingual Retrieval
Petya Osenova | Kiril Simov | Eelco Mossel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes the interaction among language resources for an adequate concept annotation of domain texts in several languages. The architecture includes domain ontology, domain texts, language specific lexicons, regular grammars and disambiguation rules. Ontology plays a central role in the architecture. We assume that it represents the meaning of the terms in the lexicons. Thus, the lexicons for the languages of the project (http://www.lt4el.eu/ - the LT4eL (Language Technology for eLearning) project is supported by the European Community under the Information Society and Media Directorate, Learning and Cultural Heritage Unit.) are constructed on the base of the ontology. The grammars and disambiguation rules facilitate the annotation of the text with concepts from the ontology. The established in this way relation between ontology and text supports different searches for content in the annotated documents. This is considered the preparatory phase for the integration of a semantic search facility in Learning Management Systems. The implementation and performance of this search are discussed in the context of related work as well as other types of searches. Also the results from some preliminary steps towards evaluation of the concept-based and text-based search are presented.

2007

2006

pdf abs
Shallow Semantic Annotation of Bulgarian
Kiril Simov | Petya Osenova
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper discusses shallow semantic annotation of Bulgarian treebank. Our goal is to construct the next layer of linguistic interpretation over the morphological and syntactic layers that have already been encoded in the treebank. The annotation is called shallow because it encodes only the senses for the non-functional words and the relations between the semantic indices connected to them. We do not encode quantifiers and scope information. An ontology is employed as a stock of the concepts and relations that form the word senses. Our lexicon is based on the Generative Lexicon (GL) model (Pustejovsky 1995) as it was implemented in the SIMPLE project (Lenci et. al. 2000). GL defines the way in which the words are connected to the concepts and the relations in the ontology. Also it provides mechanisms for literal sense changes like type-coercion, metonymy, and similar. Some of these phenomena are presented in the annotation.

2004

pdf
The CLaRK System: XML-based Corpora Development System for Rapid Prototyping
Kiril Simov | Alexander Simov | Hristo Ganev | Krasimira Ivanova | Ilko Grigorov
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf abs
A Language Resources Infrastructure for Bulgarian
Kiril Simov | Petya Osenova | Sia Kolkovska | Elisaveta Balabanova | Dimitar Doikoff
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper describes the infrastructure of a basic language resources set for Bulgarian in the context of BLARK initiative requirements. We focus on the treebanking task as a trigger for basic language resources compilation. Two strategies have been applied in this respect: (1) implementing the main pre-processing modules before the treebank compilation and (2) creating more elaborate types of resources in parallel to the treebank compilation. The description of language resources within BulTreeBank project is divided into two parts: language technology, which includes tokenization, morphosyntactic analyzer, morphosyntactic disambiguation, partial grammars, and language data, which includes the layers of the BulTreeBank corpus and the variety of lexicons. The advantages of our approach to a less-spoken language (like Bulgarian) are as follows: it triggers the creation of the basic set of language resources which lack for certain languages and it rises the question about the ways of language resources creation.

pdf
Unexpected Productions May Well be Errors
Tylman Ule | Kiril Simov
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf
Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian
Božo Bekavac | Petya Osenova | Kiril Simov | Marko Tadić
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf
A Hybrid Strategy For Regular Grammar Parsing
Kiril Simov | Petya Osenova
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)