Svetla Boytcheva

2024

pdf abs
SU-FMI at SemEval-2024 Task 5: From BERT Fine-Tuning to LLM Prompt Engineering - Approaches in Legal Argument Reasoning
Kristiyan Krumov | Svetla Boytcheva | Ivan Koytchev
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper presents our approach and findings for SemEval-2024 Task 5, focusing on legal argument reasoning. We explored the effectiveness of fine-tuning pre-trained BERT models and the innovative application of large language models (LLMs) through prompt engineering in the context of legal texts. Our methodology involved a combination of techniques to address the challenges posed by legal language processing, including handling long texts and optimizing natural language understanding (NLU) capabilities for the legal domain. Our contributions were validated by achieving a third-place ranking on the SemEval 2024 Task 5 Leaderboard. The results underscore the potential of LLMs and prompt engineering in enhancing legal reasoning tasks, offering insights into the evolving landscape of NLU technologies within the legal field.

2023

pdf abs
NEXT: An Event Schema Extension Approach for Closed-Domain Event Extraction Models
Elena Tuparova | Petar Ivanov | Andrey Tagarev | Svetla Boytcheva | Ivan Koychev
Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text

Event extraction from textual data is a NLP research task relevant to a plethora of domains. Most approaches aim to recognize events from a predefined event schema, consisting of event types and their corresponding arguments. For domains, such as disinformation, where new event types emerge frequently, there is a need to adapt such fixed event schemas to accommodate for new event types. We present NEXT (New Event eXTraction) - a resource-sparse approach to extending a close-domain model to novel event types, that requires a very small number of annotated samples for fine-tuning performed on a single GPU. Furthermore, our results suggest that this approach is suitable not only for extraction of new event types, but also for recognition of existing event types, as the use of this approach on a new dataset leads to improved recall for all existing events while retaining precision.

pdf abs
FMI-SU at SemEval-2023 Task 7: Two-level Entailment Classification of Clinical Trials Enhanced by Contextual Data Augmentation
Sylvia Vassileva | Georgi Grazhdanski | Svetla Boytcheva | Ivan Koychev
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The paper presents an approach for solving SemEval 2023 Task 7 - identifying the inference relation in a clinical trials dataset. The system has two levels for retrieving relevant clinical trial evidence for a statement and then classifying the inference relation based on the relevant sentences. In the first level, the system classifies the evidence-statement pairs as relevant or not using a BERT-based classifier and contextual data augmentation (subtask 2). Using the relevant parts of the clinical trial from the first level, the system uses an additional BERT-based classifier to determine whether the relation is entailment or contradiction (subtask 1). In both levels, the contextual data augmentation is showing a significant improvement in the F1 score on the test set of 3.7% for subtask 2 and 7.6% for subtask 1, achieving final F1 scores of 82.7% for subtask 2 and 64.4% for subtask 1.

pdf abs
Clinical Text Classification to SNOMED CT Codes Using Transformers Trained on Linked Open Medical Ontologies
Anton Hristov | Petar Ivanov | Anna Aksenova | Tsvetan Asamov | Pavlin Gyurov | Todor Primov | Svetla Boytcheva
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

We present an approach for medical text coding with SNOMED CT. Our approach uses publicly available linked open data from terminologies and ontologies as training data for the algorithms. We claim that even small training corpora made of short text snippets can be used to train models for the given task. We propose a method based on transformers enhanced with clustering and filtering of the candidates. Further, we adopt a classical machine learning approach - support vector classification (SVC) using transformer embeddings. The resulting approach proves to be more accurate than the predictions given by Large Language Models. We evaluate on a dataset generated from linked open data for SNOMED codes related to morphology and topography for four use cases. Our transformers-based approach achieves an F1-score of 0.82 for morphology and 0.99 for topography codes. Further, we validate the applicability of our approach in a clinical context using labelled real clinical data that are not used for model training.

2021

pdf abs
Application of Deep Learning Methods to SNOMED CT Encoding of Clinical Texts: From Data Collection to Extreme Multi-Label Text-Based Classification
Anton Hristov | Aleksandar Tahchiev | Hristo Papazov | Nikola Tulechki | Todor Primov | Svetla Boytcheva
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Concept normalization of clinical texts to standard medical classifications and ontologies is a task with high importance for healthcare and medical research. We attempt to solve this problem through automatic SNOMED CT encoding, where SNOMED CT is one of the most widely used and comprehensive clinical term ontologies. Applying basic Deep Learning models, however, leads to undesirable results due to the unbalanced nature of the data and the extreme number of classes. We propose a classification procedure that features a multiple-step workflow consisting of label clustering, multi-cluster classification, and clusters-to-labels mapping. For multi-cluster classification, BioBERT is fine-tuned over our custom dataset. The clusters-to-labels mapping is carried out by a one-vs-all classifier (SVC) applied to every single cluster. We also present the steps for automatic dataset generation of textual descriptions annotated with SNOMED CT codes based on public data and linked open data. In order to cope with the problem that our dataset is highly unbalanced, some data augmentation methods are applied. The results from the conducted experiments show high accuracy and reliability of our approach for prediction of SNOMED CT codes relevant to a clinical text.

pdf abs
Comparative Analysis of Fine-tuned Deep Learning Language Models for ICD-10 Classification Task for Bulgarian Language
Boris Velichkov | Sylvia Vassileva | Simeon Gerginov | Boris Kraychev | Ivaylo Ivanov | Philip Ivanov | Ivan Koychev | Svetla Boytcheva
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The task of automatic diagnosis encoding into standard medical classifications and ontologies, is of great importance in medicine - both to support the daily tasks of physicians in the preparation and reporting of clinical documentation, and for automatic processing of clinical reports. In this paper we investigate the application and performance of different deep learning transformers for automatic encoding in ICD-10 of clinical texts in Bulgarian. The comparative analysis attempts to find which approach is more efficient to be used for fine-tuning of pretrained BERT family transformer to deal with a specific domain terminology on a rare language as Bulgarian. On the one side are used SlavicBERT and MultiligualBERT, that are pretrained for common vocabulary in Bulgarian, but lack medical terminology. On the other hand in the analysis are used BioBERT, ClinicalBERT, SapBERT, BlueBERT, that are pretrained for medical terminology in English, but lack training for language models in Bulgarian, and more over for vocabulary in Cyrillic. In our research study all BERT models are fine-tuned with additional medical texts in Bulgarian and then applied to the classification task for encoding medical diagnoses in Bulgarian into ICD-10 codes. Big corpora of diagnosis in Bulgarian annotated with ICD-10 codes is used for the classification task. Such an analysis gives a good idea of which of the models would be suitable for tasks of a similar type and domain. The experiments and evaluation results show that both approaches have comparable accuracy.

Vast amounts of data in healthcare are available in unstructured text format, usually in the local language of the countries. These documents contain valuable information. Secondary use of clinical narratives and information extraction of key facts and relations from them about the patient disease history can foster preventive medicine and improve healthcare. In this paper, we propose a hybrid method for the automatic transformation of clinical text into a structured format. The documents are automatically sectioned into the following parts: diagnosis, patient history, patient status, lab results. For the “Diagnosis” section a deep learning text-based encoding into ICD-10 codes is applied using MBG-ClinicalBERT - a fine-tuned ClinicalBERT model for Bulgarian medical text. From the “Patient History” section, we identify patient symptoms using a rule-based approach enhanced with similarity search based on MBG-ClinicalBERT word embeddings. We also identify symptom relations like negation. For the “Patient Status” description, binary classification is used to determine the status of each anatomic organ. In this paper, we demonstrate different methods for adapting NLP tools for English and other languages to a low resource language like Bulgarian.

2019

pdf abs
Risk Factors Extraction from Clinical Texts based on Linked Open Data
Svetla Boytcheva | Galia Angelova | Zhivko Angelov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper presents experiments in risk factors analysis based on clinical texts enhanced with Linked Open Data (LOD). The idea is to determine whether a patient has risk factors for a specific disease analyzing only his/her outpatient records. A semantic graph of “meta-knowledge” about a disease of interest is constructed, with integrated multilingual terms (labels) of symptoms, risk factors etc. coming from Wikidata, PubMed, Wikipedia and MESH, and linked to clinical records of individual patients via ICD–10 codes. Then a predictive model is trained to foretell whether patients are at risk to develop the disease of interest. The testing was done using outpatient records from a nation-wide repository available for the period 2011-2016. The results show improvement of the overall performance of all tested algorithms (kNN, Naive Bayes, Tree, Logistic regression, ANN), when the clinical texts are enriched with LOD resources.

pdf abs
Comparison of Machine Learning Approaches for Industry Classification Based on Textual Descriptions of Companies
Andrey Tagarev | Nikola Tulechki | Svetla Boytcheva
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper addresses the task of categorizing companies within industry classification schemes. The datasets consists of encyclopedic articles about companies and their economic activities. The target classification schema is build by mapping linked open data in a semi-supervised manner. Target classes are build bottom-up from DBpedia. We apply several state of the art text classification techniques, based both on deep-learning and classical vector-space models.

pdf abs
Deep learning contextual models for prediction of sport event outcome from sportsman’s interviews
Boris Velichkov | Ivan Koychev | Svetla Boytcheva
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper presents an approach for prediction of results for sport events. Usually the sport forecasting approaches are based on structured data. We test the hypothesis that the sports results can be predicted by using natural language processing and machine learning techniques applied over interviews with the players shortly before the sport events. The proposed method uses deep learning contextual models, applied over unstructured textual documents. Several experiments were performed for interviews with players in individual sports like boxing, martial arts, and tennis. The results from the conducted experiment confirmed our initial assumption that an interview from a sportsman before a match contains information that can be used for prediction the outcome from it. Furthermore, the results provide strong evidence in support of our research hypothesis, that is, we can predict the outcome from a sport match analyzing an interview, given before it.

2017

pdf abs
Mining Association Rules from Clinical Narratives
Svetla Boytcheva | Ivelina Nikolova | Galia Angelova
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Shallow text analysis (Text Mining) uses mainly Information Extraction techniques. The low resource languages do not allow application of such traditional techniques with sufficient accuracy and recall on big data. In contrast, Data Mining approaches provide an opportunity to make deep analysis and to discover new knowledge. Frequent pattern mining approaches are used mainly for structured information in databases and are a quite challenging task in text mining. Unfortunately, most frequent pattern mining approaches do not use contextual information for extracted patterns: general patterns are extracted regardless of the context. We propose a method that processes raw informal texts (from health discussion forums) and formal texts (outpatient records) in Bulgarian language. In addition we use some context information and small terminological lexicons to generalize extracted frequent patterns. This allows to map informal expression of medical terminology to the formal one and to generate automatically resources.

pdf abs
Towards Lexical Chains for Knowledge-Graph-based Word Embeddings
Kiril Simov | Svetla Boytcheva | Petya Osenova
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Word vectors with varying dimensionalities and produced by different algorithms have been extensively used in NLP. The corpora that the algorithms are trained on can contain either natural language text (e.g. Wikipedia or newswire articles) or artificially-generated pseudo corpora due to natural data sparseness. We exploit Lexical Chain based templates over Knowledge Graph for generating pseudo-corpora with controlled linguistic value. These corpora are then used for learning word embeddings. A number of experiments have been conducted over the following test sets: WordSim353 Similarity, WordSim353 Relatedness and SimLex-999. The results show that, on the one hand, the incorporation of many-relation lexical chains improves results, but on the other hand, unrestricted-length chains remain difficult to handle with respect to their huge quantity.

bib
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017
Svetla Boytcheva | Kevin Bretonnel Cohen | Guergana Savova | Galia Angelova
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017

pdf abs
Identification of Risk Factors in Clinical Texts through Association Rules
Svetla Boytcheva | Ivelina Nikolova | Galia Angelova | Zhivko Angelov
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017

We describe a method which extracts Association Rules from texts in order to recognise verbalisations of risk factors. Usually some basic vocabulary about risk factors is known but medical conditions are expressed in clinical narratives with much higher variety. We propose an approach for data-driven learning of specialised medical vocabulary which, once collected, enables early alerting of potentially affected patients. The method is illustrated by experimens with clinical records of patients with Chronic Obstructive Pulmonary Disease (COPD) and comorbidity of CORD, Diabetes Melitus and Schizophrenia. Our input data come from the Bulgarian Diabetic Register, which is built using a pseudonymised collection of outpatient records for about 500,000 diabetic patients. The generated Association Rules for CORD are analysed in the context of demographic, gender, and age information. Valuable anounts of meaningful words, signalling risk factors, are discovered with high precision and confidence.

pdf abs
Annotation of Clinical Narratives in Bulgarian language
Ivajlo Radev | Kiril Simov | Galia Angelova | Svetla Boytcheva
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017

In this paper we describe annotation process of clinical texts with morphosyntactic and semantic information. The corpus contains 1,300 discharge letters in Bulgarian language for patients with Endocrinology and Metabolic disorders. The annotated corpus will be used as a Gold standard for information extraction evaluation of test corpus of 6,200 discharge letters. The annotation is performed within Clark system — an XML Based System For Corpora Development. It provides mechanism for semi-automatic annotation first running a pipeline for Bulgarian morphosyntactic annotation and a cascaded regular grammar for semantic annotation is run, then rules for cleaning of frequent errors are applied. At the end the result is manually checked. At the end we hope also to be able to adapted the morphosyntactic tagger to the domain of clinical narratives as well.

2012

pdf
Automatic Analysis of Patient History Episodes in Bulgarian Hospital Discharge Letters
Svetla Boytcheva | Galia Angelova | Ivelina Nikolova
Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf
Automatic Matching of ICD-10 codes to Diagnoses in Discharge Letters
Svetla Boytcheva
Proceedings of the Second Workshop on Biomedical Natural Language Processing

pdf
Towards Temporal Segmentation of Patient History in Discharge Letters
Galia Angelova | Svetla Boytcheva
Proceedings of the Second Workshop on Biomedical Natural Language Processing

2009