Petya Osenova

2024

We present ongoing work towards defining a lexicon-corpus interface to serve as a benchmark in the representation of multiword expressions (of various parts of speech) in dedicated lexica and the linking of these entries to their corpus occurrences. The final aim is the harnessing of such resources for the automatic identification of multiword expressions in a text. The involvement of several natural languages aims at the universality of a solution not centered on a particular language, and also accommodating idiosyncrasies. Challenges in the lexicographic description of multiword expressions are discussed, the current status of lexica dedicated to this linguistic phenomenon is outlined, as well as the solution we envisage for creating an ecosystem of interlinked lexica and corpora containing and, respectively, annotated with multiword expressions.

pdf abs
Bulgarian ParlaMint 4.0 corpus as a testset for Part-of-speech tagging and Named Entity Recognition
Petya Osenova | Kiril Simov
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

The paper discusses some fine-tuned models for the tasks of part-of-speech tagging and named entity recognition. The fine-tuning was performed on the basis of an existing BERT pre-trained model and two newly pre-trained BERT models for Bulgarian that are cross-tested on the domain of the Bulgarian part of the ParlaMint corpora as a new domain. In addition, a comparison has been made between the performance of the new fine-tuned BERT models and the available results from the Stanza-based model which the Bulgarian part of the ParlaMint corpora has been annotated with. The observations show the weaknesses in each model as well as the common challenges.

2023

pdf abs
Recent Developments in BTB-WordNet
Kiril Simov | Petya Osenova
Proceedings of the 12th Global Wordnet Conference

The paper reports on recent developments in Bulgarian BTB-WordNet (BTB-WN). This resource is viewed as playing a central role with respect to the integration and interlinking of various language resources such as: e-dictionaries (morphological, terminological, bilingual, orthographic, etymological and explanatory, etc., including editions from previous periods); corpora (coming from outside or being internal - like the corpus of definitions as well as the corpus of examples to synset meanings); ontologies (such as CIDOC-CRM, DBpedia, etc.); sources of world knowledge (such as information from the Bulgarian Encyclopedia, Wikipedia, etc.). The paper also gives information about a number of applications built on BTB-WN. These are: the Bulgaria-centered knowledge graph, the All about word application as well as some education-oriented exercises.

We present bgGLUE (Bulgarian General Language Understanding Evaluation), a benchmark for evaluating language models on Natural Language Understanding (NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety of NLP problems (e.g., natural language inference, fact-checking, named entity recognition, sentiment analysis, question answering, etc.) and machine learning tasks (sequence labeling, document-level classification, and regression). We run the first systematic evaluation of pre-trained language models for Bulgarian, comparing and contrasting results across the nine tasks in the benchmark. The evaluation results show strong performance on sequence labeling tasks, but there is a lot of room for improvement for tasks that require more complex reasoning. We make bgGLUE publicly available together with the fine-tuning and the evaluation code, as well as a public leaderboard at https://bgglue.github.io, and we hope that it will enable further advancements in developing NLU models for Bulgarian.

pdf abs
Transformer-Based Language Models for Bulgarian
Iva Marinova | Kiril Simov | Petya Osenova
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This paper presents an approach for training lightweight and robust language models for Bulgarian that mitigate gender, political, racial, and other biases in the data. Our method involves scraping content from major Bulgarian online media providers using a specialized procedure for source filtering, topic selection, and lexicon-based removal of inappropriate language during the pre-training phase. We continuously improve the models by incorporating new data from various domains, including social media, books, scientific literature, and linguistically modified corpora. Our motivation is to provide a solution that is sufficient for all natural language processing tasks in Bulgarian, and to address the lack of existing procedures for guaranteeing the robustness of such models.

2022

In ParlaMint I, a CLARIN-ERIC supported project in pandemic times, a set of comparable and uniformly annotated multilingual corpora for 17 national parliaments were developed and released in 2021. For 2022 and 2023, the project has been extended to ParlaMint II, again with the CLARIN ERIC financial support, in order to enhance the existing corpora with new data and metadata; upgrade the XML schema; add corpora for 10 new parliaments; provide more application scenarios and carry out additional experiments. The paper reports on these planned steps, including some that have already been taken, and outlines future plans.

pdf abs
The Bulgarian Event Corpus: Overview and Initial NER Experiments
Petya Osenova | Kiril Simov | Iva Marinova | Melania Berbatova
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The paper describes the Bulgarian Event Corpus (BEC). The annotation scheme is based on CIDOC-CRM ontology and on the English Framenet, adjusted for our task. It includes two main layers: named entities and events with their roles. The corpus is multi-domain and mainly oriented towards Social Sciences and Humanities (SSH). It will be used for: extracting knowledge and making it available through the Bulgaria-centric Knowledge Graph; further developing an annotation scheme that handles multiple domains in SSH; training automatic modules for the most important knowledge-based tasks, such as domain-specific and nested NER, NEL, event detection and profiling. Initial experiments were conducted on standard NER task due to complexity of the dataset and the rich NE annotation scheme. The results are promising with respect to some labels and give insights on handling better other ones. These experiments serve also as error detection modules that would help us in scheme re-design. They are a basis for further and more complex tasks, such as nested NER, NEL and event detection.

pdf abs
Raising and Control Constructions in a Bulgarian UD Parsebank of Parliament Sessions
Petya Osenova
Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

The paper discusses the raising and control syntactic structures (marked as ‘xcomp’) in a UD parsed corpus of Bulgarian Parliamentary Sessions. The idea is: to investigate the linguistic status of this phenomenon in an automatically parsed corpus, with a focus on verbal constructions of a head and its dependant together with the shared subject; to detect the errors and get insights on how to improve the annotation scheme and the automatic detection of this phenomenon realizations in Bulgarian.

2021

pdf abs
Monitoring Fact Preservation, Grammatical Consistency and Ethical Behavior of Abstractive Summarization Neural Models
Iva Marinova | Yolina Petrova | Milena Slavcheva | Petya Osenova | Ivaylo Radev | Kiril Simov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The paper describes a system for automatic summarization in English language of online news data that come from different non-English languages. The system is designed to be used in production environment for media monitoring. Automatic summarization can be very helpful in this domain when applied as a helper tool for journalists so that they can review just the important information from the news channels. However, like every software solution, the automatic summarization needs performance monitoring and assured safe environment for the clients. In media monitoring environment the most problematic features to be addressed are: the copyright issues, the factual consistency, the style of the text and the ethical norms in journalism. Thus, the main contribution of our present work is that the above mentioned characteristics are successfully monitored in neural automatic summarization models and improved with the help of validation, fact-preserving and fact-checking procedures.

This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages, and five teams participated in the cross-lingual entity linking task. Detailed valuation information is available on the shared task web page.

2020

pdf
Implementing an End-to-End Treebank-Informed Pipeline for Bulgarian
Alexander Popov | Petya Osenova | Kiril Simov
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

pdf abs
On the Valency Frames of type Subject-Predicate in Bulgarian
Petya Osenova
Proceedings of the 4th International Conference on Computational Linguistics in Bulgaria (CLIB 2020)

The paper presents some observations on the semantic constraints of the intransitive subjects with respect to the predicates they combine with. For these observations a valency dictionary of Bulgarian was used. Here two clarifications are to be made. First, the intransitive predicates are viewed in a broader perspective. They combine true intransitives as well as intransitive usages of transitive verbs. The complexity comes from the modeling of these verbs in the morphological dictionary. Second, the semantic constraints that are considered here, are limited to a set of semantic roles and build on the lexicographic classes of verbs in WordNet.

Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.

pdf abs
Reconstructing NER Corpora: a Case Study on Bulgarian
Iva Marinova | Laska Laskova | Petya Osenova | Kiril Simov | Alexander Popov
Proceedings of the Twelfth Language Resources and Evaluation Conference

The paper reports on the usage of deep learning methods for improving a Named Entity Recognition (NER) training corpus and for predicting and annotating new types in a test corpus. We show how the annotations in a type-based corpus of named entities (NE) were populated as occurrences within it, thus ensuring density of the training information. A deep learning model was adopted for discovering inconsistencies in the initial annotation and for learning new NE types. The evaluation results get improved after data curation, randomization and deduplication.

2019

pdf abs
Know Your Graph. State-of-the-Art Knowledge-Based WSD
Alexander Popov | Kiril Simov | Petya Osenova
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper introduces several improvements over the current state of the art in knowledge-based word sense disambiguation. Those innovations are the result of modifying and enriching a knowledge base created originally on the basis of WordNet. They reflect several separate but connected strategies: manipulating the shape and the content of the knowledge base, assigning weights over the relations in the knowledge base, and the addition of new relations to it. The main contribution of the paper is to demonstrate that the previously proposed knowledge bases organize linguistic and world knowledge suboptimally for the task of word sense disambiguation. In doing so, the paper also establishes a new state of the art for knowledge-based approaches. Its best models are competitive in the broader context of supervised systems as well.

pdf abs
A Morpho-Syntactically Informed LSTM-CRF Model for Named Entity Recognition
Lilia Simeonova | Kiril Simov | Petya Osenova | Preslav Nakov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We propose a morphologically informed model for named entity recognition, which is based on LSTM-CRF architecture and combines word embeddings, Bi-LSTM character embeddings, part-of-speech (POS) tags, and morphological information. While previous work has focused on learning from raw word input, using word and character embeddings only, we show that for morphologically rich languages, such as Bulgarian, access to POS information contributes more to the performance gains than the detailed morphological information. Thus, we show that named entity recognition needs only coarse-grained POS tags, but at the same time it can benefit from simultaneously using some POS information of different granularity. Our evaluation results over a standard dataset show sizeable improvements over the state-of-the-art for Bulgarian NER.

pdf abs
Modeling MWEs in BTB-WN
Laska Laskova | Petya Osenova | Kiril Simov | Ivajlo Radev | Zara Kancheva
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

The paper presents the characteristics of the predominant types of MultiWord expressions (MWEs) in the BulTreeBank WordNet – BTB-WN. Their distribution in BTB-WN is discussed with respect to the overall hierarchical organization of the lexical resource. Also, a catena-based modeling is proposed for handling the issues of lexical semantics of MWEs.

pdf
Towards transferring Bulgarian Sentences with Elliptical Elements to Universal Dependencies: issues and strategies
Petya Osenova | Kiril Simov
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

pdf bib
Proceedings of the Workshop on Language Technology for Digital Historical Archives
Cristina Vertan | Petya Osenova | Dimitar Iliev
Proceedings of the Workshop on Language Technology for Digital Historical Archives

pdf abs
Aligning the Bulgarian BTB WordNet with the Bulgarian Wikipedia
Kiril Simov | Petya Osenova | Laska Laskova | Ivajlo Radev | Zara Kancheva
Proceedings of the 10th Global Wordnet Conference

The paper reports on an ongoing work that manually maps the Bulgarian WordNet BTB-WN with Bulgarian Wikipedia. The preparatory work of extracting the Wikipedia articles and provisionally relating them to the WordNet lemmas was done automatically. The manual work includes checking of the corresponding senses in both resources as well as the missing ones. The main cases of mapping are considered. The first experiments of mapping about 1000 synsets show the establishment of more than 78 % of exact correspondences and nearly 15 % of new synsets.

2018

pdf abs
Grammatical Role Embeddings for Enhancements of Relation Density in the Princeton WordNet
Kiril Simov | Alexander Popov | Iliana Simova | Petya Osenova
Proceedings of the 9th Global Wordnet Conference

In this paper we present an approach for training verb subatom embeddings. For each verb we learn several embeddings rather than only one. These embeddings include the verb itself as well as embeddings for each grammatical role of this verb. To give an example, for the verb ‘to give’ we learn four embeddings: one for the lemma ‘give’, one for the subject, one for the direct object and one for the indirect object. We have exploited these grammatical role embeddings in order to add new syntagmatic relations to WordNet. The evaluation of the new relations quality has been done extrinsically through the Knowledge-based Word Sense Disambiguation task.

2017

pdf abs
Bulgarian-English and English-Bulgarian Machine Translation: System Design and Evaluation
Petya Osenova | Kiril Simov
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

The paper presents a deep factored machine translation (MT) system between English and Bulgarian languages in both directions. The MT system is hybrid. It consists of three main steps: (1) the source-language text is linguistically annotated, (2) it is translated to the target language with the Moses system, and (3) translation is post-processed with the help of the transferred linguistic annotation from the source text. Besides automatic evaluation we performed manual evaluation over a domain test suite of sentences demonstrating certain phenomena like imperatives, questions, etc.

pdf abs
Towards Lexical Chains for Knowledge-Graph-based Word Embeddings
Kiril Simov | Svetla Boytcheva | Petya Osenova
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Word vectors with varying dimensionalities and produced by different algorithms have been extensively used in NLP. The corpora that the algorithms are trained on can contain either natural language text (e.g. Wikipedia or newswire articles) or artificially-generated pseudo corpora due to natural data sparseness. We exploit Lexical Chain based templates over Knowledge Graph for generating pseudo-corpora with controlled linguistic value. These corpora are then used for learning word embeddings. A number of experiments have been conducted over the following test sets: WordSim353 Similarity, WordSim353 Relatedness and SimLex-999. The results show that, on the one hand, the incorporation of many-relation lexical chains improves results, but on the other hand, unrestricted-length chains remain difficult to handle with respect to their huge quantity.

pdf
Recent Developments within BulTreeBank
Petya Osenova | Kiril Simov
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

bib
Proceedings of the Workshop Knowledge Resources for the Socio-Economic Sciences and Humanities associated with RANLP 2017
Kalliopi Zervanou | Petya Osenova | Eveline Wandl-Vogt | Dan Cristea
Proceedings of the Workshop Knowledge Resources for the Socio-Economic Sciences and Humanities associated with RANLP 2017

bib
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe
Anca Dinu | Petya Osenova | Cristina Vertan
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe

2016

pdf abs
The Role of the WordNet Relations in the Knowledge-based Word Sense Disambiguation Task
Kiril Simov | Alexander Popov | Petya Osenova
Proceedings of the 8th Global WordNet Conference (GWC)

In this paper we present an analysis of different semantic relations extracted from WordNet, Extended WordNet and SemCor, with respect to their role in the task of knowledge-based word sense disambiguation. The experiments use the same algorithm and the same test sets, but different variants of the knowledge graph. The results show that different sets of relations have different impact on the results: positive or negative. The beneficial ones are discussed with respect to the combination of relations and with respect to the test set. The inclusion of inference has only a modest impact on accuracy, while the addition of syntactic relations produces stable improvement over the baselines.

pdf
Towards Semantic-based Hybrid Machine Translation between Bulgarian and English
Kiril Simov | Petya Osenova | Alexander Popov
Proceedings of the 2nd Workshop on Semantics-Driven Machine Translation (SedMT 2016)

pdf
A Hybrid Approach for Deep Machine Translation
Kiril Simov | Petya Osenova
Proceedings of the 2nd Deep Machine Translation Workshop

pdf abs
MWEs in Treebanks: From Survey to Guidelines
Victoria Rosén | Koenraad De Smedt | Gyri Smørdal Losnegaard | Eduard Bejček | Agata Savary | Petya Osenova
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

By means of an online survey, we have investigated ways in which various types of multiword expressions are annotated in existing treebanks. The results indicate that there is considerable variation in treatments across treebanks and thereby also, to some extent, across languages and across theoretical frameworks. The comparison is focused on the annotation of light verb constructions and verbal idioms. The survey shows that the light verb constructions either get special annotations as such, or are treated as ordinary verbs, while VP idioms are handled through different strategies. Based on insights from our investigation, we propose some general guidelines for annotating multiword expressions in treebanks. The recommendations address the following application-based needs: distinguishing MWEs from similar but compositional constructions; searching distinct types of MWEs in treebanks; awareness of literal and nonliteral meanings; and normalization of the MWE representation. The cross-lingually and cross-theoretically focused survey is intended as an aid to accessing treebanks and an aid for further work on treebank annotation.

The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud of linguistic resources, which covers various linguistic databases, lexicons, corpora, terminologies, and metadata repositories. We present and summarize five years of progress on the development of the cloud and of advancements in open data in linguistics, and we describe recent community activities. The paper aims to serve as a guideline to orient and involve researchers with the community and/or Linguistic Linked Open Data.

This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.

2015

pdf
Catena Operations for Unified Dependency Analysis
Kiril Simov | Petya Osenova
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

pdf bib
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications
Christian Chiarcos | John Philip McCrae | Petya Osenova | Philipp Cimiano | Nancy Ide
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications

pdf
Universalizing BulTreeBank: a Linguistic Tale about Glocalization
Petya Osenova | Kiril Simov
The 5th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects
Preslav Nakov | Marcos Zampieri | Petya Osenova | Liling Tan | Cristina Vertan | Nikola Ljubešić | Jörg Tiedemann
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

pdf bib
Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data
Piek Vossen | German Rigau | Petya Osenova | Kiril Simov
Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data

pdf
Factored models for Deep Machine Translation
Kiril Simov | Iliana Simova | Velislava Todorova | Petya Osenova
Proceedings of the 1st Deep Machine Translation Workshop

pdf
Improving Word Sense Disambiguation with Linguistic Knowledge from a Sense Annotated Treebank
Kiril Simov | Alexander Popov | Petya Osenova
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants
Preslav Nakov | Petya Osenova | Cristina Vertan
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants

pdf bib
Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014)
Constantin Orasan | Petya Osenova | Cristina Vertan
Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014)

pdf bib
Joint Ensemble Model for POS Tagging and Dependency Parsing
Iliana Simova | Dimitar Vasilev | Alexander Popov | Kiril Simov | Petya Osenova
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

pdf abs
A System for Experiments with Dependency Parsers
Kiril Simov | Iliana Simova | Ginka Ivanova | Maria Mateva | Petya Osenova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present a system for experimenting with combinations of dependency parsers. The system supports initial training of different parsing models, creation of parsebank(s) with these models, and different strategies for the construction of ensemble models aimed at improving the output of the individual models by voting. The system employs two algorithms for construction of dependency trees from several parses of the same sentence and several ways for ranking of the arcs in the resulting trees. We have performed experiments with state-of-the-art dependency parsers including MaltParser, MSTParser, TurboParser, and MATEParser, on the data from the Bulgarian treebank – BulTreeBank. Our best result from these experiments is slightly better then the best result reported in the literature for this language.

pdf abs
Constituency Parsing of Bulgarian: Word- vs Class-based Parsing
Masood Ghayoomi | Kiril Simov | Petya Osenova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we report the obtained results of two constituency parsers trained with BulTreeBank, an HPSG-based treebank for Bulgarian. To reduce the data sparsity problem, we propose using the Brown word clustering to do an off-line clustering and map the words in the treebank to create a class-based treebank. The observations show that when the classes outnumber the POS tags, the results are better. Since this approach adds on another dimension of abstraction (in comparison to the lemma), its coarse-grained representation can be used further for training statistical parsers.

2013

pdf
Combining POS Tagging, Dependency Parsing and Coreferential Resolution for Bulgarian
Valentin Zhikov | Georgi Georgiev | Kiril Simov | Petya Osenova
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction
Diana Maynard | Marieke van Erp | Brian Davis | Petya Osenova | Kiril Simov | Georgi Georgiev | Preslav Nakov
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction

pdf bib
Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants
Cristina Vertan | Milena Slavcheva | Petya Osenova
Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants

2012

pdf abs
A Treebank-driven Creation of an OntoValence Verb lexicon for Bulgarian
Petya Osenova | Kiril Simov | Laska Laskova | Stanislava Kancheva
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper presents a treebank-driven approach to the construction of a Bulgarian valence lexicon with ontological restrictions over the inner participants of the event. First, the underlying ideas behind the Bulgarian Ontology-based lexicon are outlined. Then, the extraction and manipulation of the valence frames is discussed with respect to the BulTreeBank annotation scheme and DOLCE ontology. Also, the most frequent types of syntactic frames are specified as well as the most frequent types of ontological restrictions over the verb arguments. The envisaged application of such a lexicon would be: in assigning ontological labels to syntactically parsed corpora, and expanding the lexicon and lexical information in the Bulgarian Resource Grammar.

pdf abs
Linguistic Analysis Processing Line for Bulgarian
Aleksandar Savkov | Laska Laskova | Stanislava Kancheva | Petya Osenova | Kiril Simov
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents a linguistic processing pipeline for Bulgarian including morphological analysis, lemmatization and syntactic analysis of Bulgarian texts. The morphological analysis is performed by three modules ― two statistical-based and one rule-based. The combination of these modules achieves the best result for morphological tagging of Bulgarian over a rich tagset (680 tags). The lemmatization is based on rules, generated from a large morphological lexicon of Bulgarian. The syntactic analysis is implemented via MaltParser. The two statistical morphological taggers and MaltParser are trained on datasets constructed within BulTreeBank project. The processing pipeline includes also a sentence splitter and a tokenizer. All tools in the pipeline are packed in modules that can also perform separately. The whole pipeline is designed to be able to serve as a back-end of a web service oriented interface, but it also supports the user tasks with a command-line interface. The processing pipeline is compatible with the Text Corpus Format, which allows it to delegate the management of the components to the WebLicht platform.

pdf abs
The Political Speech Corpus of Bulgarian
Petya Osenova | Kiril Simov
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper introduces the Political Speech Corpus of Bulgarian. First, its current state has been discussed with respect to its size, coverage, genre specification and related online services. Then, the focus goes to the annotation details. On the one hand, the layers of linguistic annotation are presented. On the other hand, the compatibility with CLARIN technical Infrastructure is explained. Also, some user-based scenarios are mentioned to demonstrate the corpus services and applicability.

pdf
Linguistically-Augmented Bulgarian-to-English Statistical Machine Translation Model
Rui Wang | Petya Osenova | Kiril Simov
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

pdf bib
Linguistically-Enriched Models for Bulgarian-to-English Machine Translation
Rui Wang | Petya Osenova | Kiril Simov
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf
Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgarian
Georgi Georgiev | Valentin Zhikov | Kiril Simov | Petya Osenova | Preslav Nakov
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf bib
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage
Cristina Vertan | Milena Slavcheva | Petya Osenova | Stelios Piperidis
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage

pdf
Language Technology Support for Semantic Annotation of Icono-graphic Descriptions
Kamenka Staykova | Gennady Agre | Kiril Simov | Petya Osenova
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage

pdf bib
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora
Kiril Simov | Petya Osenova | Jörg Tiedemann | Radovan Garabik
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora

pdf
Bulgarian-English Parallel Treebank: Word and Semantic Level Alignment
Kiril Simov | Petya Osenova | Laska Laskova | Aleksandar Savkov | Stanislava Kancheva
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora

pdf
Towards Minimal Recursion Semantics over Bulgarian Dependency Parsing
Kiril Simov | Petya Osenova
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

In this paper we describe GikiCLEF, the first evaluation contest that, to our knowledge, was specifically designed to expose and investigate cultural and linguistic issues involved in structured multimedia collections and searching, and which was organized under the scope of CLEF 2009. GikiCLEF evaluated systems that answered hard questions for both human and machine, in ten different Wikipedia collections, namely Bulgarian, Dutch, English, German, Italian, Norwegian (Bokmäl and Nynorsk), Portuguese, Romanian, and Spanish. After a short historical introduction, we present the task, together with its motivation, and discuss how the topics were chosen. Then we provide another description from the point of view of the participants. Before disclosing their results, we introduce the SIGA management system explaining the several tasks which were carried out behind the scenes. We quantify in turn the GIRA resource, offered to the community for training and further evaluating systems with the help of the 50 topics gathered and the solutions identified. We end the paper with a critical discussion of what was learned, advancing possible ways to reuse the data.

pdf abs
Exploring Co-Reference Chains for Concept Annotation of Domain Texts
Petya Osenova | Laska Laskova | Kiril Simov
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper explores the co-reference chains as a way for improving the density of concept annotation over domain texts. The idea extends authors previous work on relating the ontology to the text terms in two domains ― IT and textile. Here IT domain is used. The challenge is to enhance relations among concepts instead of text entities, the latter pursued in most works. Our ultimate goal is to exploit these additional chains for concept disambiguation as well as sparseness resolution at concept level. First, a gold standard was prepared with manually connected links among concepts, anaphoric pronouns and contextual equivalents. This step was necessary not only for test purposes, but also for better orientation in the co-referent types and distribution. Then, two automatic systems were tested on the gold standard. Note that these systems were not designed specially for concept chaining. The conclusion is that the state-of-the-art co-reference resolution systems might address the concept sparseness problem, but not so much the concept disambiguation task. For the latter, word-sense disambiguation systems have to be integrated.

pdf abs
Constructing of an Ontology-based Lexicon for Bulgarian
Kiril Simov | Petya Osenova
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we report on the progress in the creation of an Ontology-based lexicon for Bulgarian. We have started with the concept set from an upper ontology (DOLCE). Then it was extended with concepts selected from the OntoWordNet, which correspond to Core WordNet and EuroWordNet Basic concepts. The underlying idea behind the ontology-based lexicon is its organization via two semantic relations - equivalence and subsumption. These relations reflect the distribution of lexical unit senses with respect to the concepts in the ontology. The lexical unit candidates for concept mapping have been selected from two large and well-developed lexical resources for Bulgarian - a machine readable explanatory dictionary and a morphological lexicon. In the initial step, the lexical units were handled that have equivalent senses to the concepts in the ontology (2500 at the moment). Then, in the second stage, we are proceeding with lexical units selected on their frequency distribution in a large Bulgarian corpus. This step is the more challenging one, since it might require also additions of concepts to the ontology. The main applications of the lexicon are envisaged to be the semantic annotation and semantic IR for Bulgarian.

2009

pdf
Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields
Georgi Georgiev | Preslav Nakov | Kuzman Ganchev | Petya Osenova | Kiril Simov
Proceedings of the International Conference RANLP-2009

pdf
A Web-Enabled and Speech-Enhanced Parallel Corpus of Greek-Bulgarian Cultural Texts
Voula Giouli | Nikos Glaros | Kiril Simov | Petya Osenova
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

pdf bib
Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains
Núria Bel | Erhard Hinrichs | Petya Osenova | Kiril Simov
Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains

pdf
Cross-lingual Adaptation as a Baseline: Adapting Maximum Entropy Models to Bulgarian
Georgi Georgiev | Preslav Nakov | Petya Osenova | Kiril Simov
Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains

2008

pdf abs
Language Resources for Semantic Document Annotation and Crosslingual Retrieval
Petya Osenova | Kiril Simov | Eelco Mossel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes the interaction among language resources for an adequate concept annotation of domain texts in several languages. The architecture includes domain ontology, domain texts, language specific lexicons, regular grammars and disambiguation rules. Ontology plays a central role in the architecture. We assume that it represents the meaning of the terms in the lexicons. Thus, the lexicons for the languages of the project (http://www.lt4el.eu/ - the LT4eL (Language Technology for eLearning) project is supported by the European Community under the Information Society and Media Directorate, Learning and Cultural Heritage Unit.) are constructed on the base of the ontology. The grammars and disambiguation rules facilitate the annotation of the text with concepts from the ontology. The established in this way relation between ontology and text supports different searches for content in the annotated documents. This is considered the preparatory phase for the integration of a semantic search facility in Learning Management Systems. The implementation and performance of this search are discussed in the context of related work as well as other types of searches. Also the results from some preliminary steps towards evaluation of the concept-based and text-based search are presented.

2007

2006

pdf abs
Shallow Semantic Annotation of Bulgarian
Kiril Simov | Petya Osenova
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper discusses shallow semantic annotation of Bulgarian treebank. Our goal is to construct the next layer of linguistic interpretation over the morphological and syntactic layers that have already been encoded in the treebank. The annotation is called shallow because it encodes only the senses for the non-functional words and the relations between the semantic indices connected to them. We do not encode quantifiers and scope information. An ontology is employed as a stock of the concepts and relations that form the word senses. Our lexicon is based on the Generative Lexicon (GL) model (Pustejovsky 1995) as it was implemented in the SIMPLE project (Lenci et. al. 2000). GL defines the way in which the words are connected to the concepts and the relations in the ontology. Also it provides mechanisms for literal sense changes like type-coercion, metonymy, and similar. Some of these phenomena are presented in the annotation.

This paper presents an overview of the Multilingual Question Answering evaluation campaigns which have been organized at CLEF (Cross Language Evaluation Forum) since 2003. Over the years, the competition has registered a steady increment in the number of participants and languages involved. In fact, from the original eight groups which participated in 2003 QA track, the number of competitors in 2005 rose to twenty-four. Also, the performances of the systems have steadily improved, and the average of the best performances in the 2005 saw an increase of 10% with respect to the previous year.

2004

pdf abs
A Language Resources Infrastructure for Bulgarian
Kiril Simov | Petya Osenova | Sia Kolkovska | Elisaveta Balabanova | Dimitar Doikoff
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper describes the infrastructure of a basic language resources set for Bulgarian in the context of BLARK initiative requirements. We focus on the treebanking task as a trigger for basic language resources compilation. Two strategies have been applied in this respect: (1) implementing the main pre-processing modules before the treebank compilation and (2) creating more elaborate types of resources in parallel to the treebank compilation. The description of language resources within BulTreeBank project is divided into two parts: language technology, which includes tokenization, morphosyntactic analyzer, morphosyntactic disambiguation, partial grammars, and language data, which includes the layers of the BulTreeBank corpus and the variety of lexicons. The advantages of our approach to a less-spoken language (like Bulgarian) are as follows: it triggers the creation of the basic set of language resources which lack for certain languages and it rises the question about the ways of language resources creation.

pdf
Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian
Božo Bekavac | Petya Osenova | Kiril Simov | Marko Tadić
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf
A Hybrid Strategy For Regular Grammar Parsing
Kiril Simov | Petya Osenova
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)