2023
pdf
abs
TermoUD - a language-independent terminology extraction tool
Malgorzata Marciniak
|
Piotr Rychlik
|
Agnieszka Mykowiecka
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
The paper addresses TermoUD — a language-independent terminology extraction tool. Itsprevious version, i.e. TermoPL (Marciniak et al., 2016; Rychlik et al., 2022), uses languagedependent shallow grammar which selects candidate terms. The goal behind the development of TermoUD is to make the procedure as universal as possible, while taking care of the linguistic correctness of selected phrases. The tool is suitable for languages for which the Universal Dependencies (UD) parser exists. We describe a method of candidate term extraction based on UD POS tags and UD relations. The candidate ranking is performed by the C-value metric (contexts counting is adapted to the UD formalism), which doesn’t need any additional language resources. The performance of the tool has been tested on texts in English, French, Dutch, and Slovenian. The results are evaluated on the manually annotated datasets: ACTER, RD-TEC 2.0, GENIA and RSDO5, and compared to those obtained by other tools.
2020
pdf
abs
Supporting terminology extraction with dependency parses
Malgorzata Marciniak
|
Piotr Rychlik
|
Agnieszka Mykowiecka
Proceedings of the 6th International Workshop on Computational Terminology
Terminology extraction procedure usually consists of selecting candidates for terms and ordering them according to their importance for the given text or set of texts. Depending on the method used, a list of candidates contains different fractions of grammatically incorrect, semantically odd and irrelevant sequences. The aim of this work was to improve term candidate selection by reducing the number of incorrect sequences using a dependency parser for Polish.
pdf
abs
Are White Ravens Ever White? - Non-Literal Adjective-Noun Phrases in Polish
Agnieszka Mykowiecka
|
Malgorzata Marciniak
Proceedings of the Twelfth Language Resources and Evaluation Conference
In the paper we describe two resources of Polish data focused on literal and metaphorical meanings of adjective-noun phrases. The first one is FigAN and consists of isolated phrases which are divided into three types: phrases with only literal meaning, with only metaphorical meaning, and phrases which can be interpreted as literal or metaphorical ones depending on a context of use. The second data is the FigSen corpus which consists of 1833 short fragments of texts containing at least one phrase from the FigAN data which may have both meanings. The corpus is annotated in two ways. One approach concerns annotation of all adjective-noun phrases. In the second approach, literal or metaphorical senses are assigned to all adjectives and nouns in the data. The paper addresses statistics of data and compares two types of annotation. The corpora were used in experiments of automatic recognition of Polish non-literal adjective noun phrases.
2019
pdf
abs
Experiments with ad hoc ambiguous abbreviation expansion
Agnieszka Mykowiecka
|
Malgorzata Marciniak
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)
The paper addresses experiments to expand ad hoc ambiguous abbreviations in medical notes on the basis of morphologically annotated texts, without using additional domain resources. We work on Polish data but the described approaches can be used for other languages too. We test two methods to select candidates for word abbreviation expansions. The first one automatically selects all words in text which might be an expansion of an abbreviation according to the language rules. The second method uses clustering of abbreviation occurrences to select representative elements which are manually annotated to determine lists of potential expansions. We then train a classifier to assign expansions to abbreviations based on three training sets: automatically obtained, consisting of manual annotation, and concatenation of the two previous ones. The results obtained for the manually annotated training data significantly outperform automatically obtained training data. Adding the automatically obtained training data to the manually annotated data improves the results, in particular for less frequent abbreviations. In this context the proposed a priori data driven selection of possible extensions turned out to be crucial.
2018
pdf
abs
Literal, Metphorical or Both? Detecting Metaphoricity in Isolated Adjective-Noun Phrases
Agnieszka Mykowiecka
|
Malgorzata Marciniak
|
Aleksander Wawer
Proceedings of the Workshop on Figurative Language Processing
The paper addresses the classification of isolated Polish adjective-noun phrases according to their metaphoricity. We tested neural networks to predict if a phrase has a literal or metaphorical sense or can have both senses depending on usage. The input to the neural network consists of word embeddings, but we also tested the impact of information about the domain of the adjective and about the abstractness of the noun. We applied our solution to English data available on the Internet and compared it to results published in papers. We found that the solution based on word embeddings only can achieve results comparable with complex solutions requiring additional information.
pdf
abs
Detecting Figurative Word Occurrences Using Recurrent Neural Networks
Agnieszka Mykowiecka
|
Aleksander Wawer
|
Malgorzata Marciniak
Proceedings of the Workshop on Figurative Language Processing
The paper addresses detection of figurative usage of words in English text. The chosen method was to use neural nets fed by pretrained word embeddings. The obtained results show that simple solutions, based on words embeddings only, are comparable to complex solutions, using many sources of information which are not available for languages less-studied than English.
pdf
SimLex-999 for Polish
Agnieszka Mykowiecka
|
Małgorzata Marciniak
|
Piotr Rychlik
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
abs
Recognition of non-domain phrases in automatically extracted lists of terms
Agnieszka Mykowiecka
|
Malgorzata Marciniak
|
Piotr Rychlik
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)
In the paper, we address the problem of recognition of non-domain phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms and discourse function expressions. We tested several methods based on domain corpora comparison and a method based on contexts of phrases identified in a large corpus of general language. We compared the results of the methods to manual annotation. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method with the precision about 0.75 at the half of the tested list was the context based method using a modified contextual diversity coefficient.
pdf
abs
TermoPL - a Flexible Tool for Terminology Extraction
Malgorzata Marciniak
|
Agnieszka Mykowiecka
|
Piotr Rychlik
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The purpose of this paper is to introduce the TermoPL tool created to extract terminology from domain corpora in Polish. The program extracts noun phrases, term candidates, with the help of a simple grammar that can be adapted for user’s needs. It applies the C-value method to rank term candidates being either the longest identified nominal phrases or their nested subphrases. The method operates on simplified base forms in order to unify morphological variants of terms and to recognize their contexts. We support the recognition of nested terms by word connection strength which allows us to eliminate truncated phrases from the top part of the term list. The program has an option to convert simplified forms of phrases into correct phrases in the nominal case. TermoPL accepts as input morphologically annotated and disambiguated domain texts and creates a list of terms, the top part of which comprises domain terminology. It can also compare two candidate term lists using three different coefficients showing asymmetry of term occurrences in this data.
2014
pdf
NPMI Driven Recognition of Nested Terms
Malgorzata Marciniak
|
Agnieszka Mykowiecka
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)
2012
pdf
Combining Wordnet and Morphosyntactic Information in Terminology Clustering
Agnieszka Mykowiecka
|
Malgorzata Marciniak
Proceedings of COLING 2012
2011
pdf
Towards Morphologically Annotated Corpus of Hospital Discharge Reports in Polish
Malgorzata Marciniak
|
Agnieszka Mykowiecka
Proceedings of BioNLP 2011 Workshop
2007
pdf
Information Extraction from Patients’ Free Form Documentation
Agnieszka Mykowiecka
|
Malgorzata Marciniak
Biological, translational, and clinical language processing
pdf
Automatic Processing of Diabetic Patients’ Hospital Documentation
Małgorzata Marciniak
|
Agnieszka Mykowiecka
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing
2000
pdf
An HPSG-Annotated Test Suite for Polish
Malgorzata Marciniak
|
Agnieszka Mykowiecka
|
Anna Kupść
|
Adam Przepiórkowski
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)