2023
pdf
abs
TermoUD - a language-independent terminology extraction tool
Malgorzata Marciniak
|
Piotr Rychlik
|
Agnieszka Mykowiecka
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
The paper addresses TermoUD — a language-independent terminology extraction tool. Itsprevious version, i.e. TermoPL (Marciniak et al., 2016; Rychlik et al., 2022), uses languagedependent shallow grammar which selects candidate terms. The goal behind the development of TermoUD is to make the procedure as universal as possible, while taking care of the linguistic correctness of selected phrases. The tool is suitable for languages for which the Universal Dependencies (UD) parser exists. We describe a method of candidate term extraction based on UD POS tags and UD relations. The candidate ranking is performed by the C-value metric (contexts counting is adapted to the UD formalism), which doesn’t need any additional language resources. The performance of the tool has been tested on texts in English, French, Dutch, and Slovenian. The results are evaluated on the manually annotated datasets: ACTER, RD-TEC 2.0, GENIA and RSDO5, and compared to those obtained by other tools.
2020
pdf
abs
Supporting terminology extraction with dependency parses
Malgorzata Marciniak
|
Piotr Rychlik
|
Agnieszka Mykowiecka
Proceedings of the 6th International Workshop on Computational Terminology
Terminology extraction procedure usually consists of selecting candidates for terms and ordering them according to their importance for the given text or set of texts. Depending on the method used, a list of candidates contains different fractions of grammatically incorrect, semantically odd and irrelevant sequences. The aim of this work was to improve term candidate selection by reducing the number of incorrect sequences using a dependency parser for Polish.
pdf
abs
Are White Ravens Ever White? - Non-Literal Adjective-Noun Phrases in Polish
Agnieszka Mykowiecka
|
Malgorzata Marciniak
Proceedings of the Twelfth Language Resources and Evaluation Conference
In the paper we describe two resources of Polish data focused on literal and metaphorical meanings of adjective-noun phrases. The first one is FigAN and consists of isolated phrases which are divided into three types: phrases with only literal meaning, with only metaphorical meaning, and phrases which can be interpreted as literal or metaphorical ones depending on a context of use. The second data is the FigSen corpus which consists of 1833 short fragments of texts containing at least one phrase from the FigAN data which may have both meanings. The corpus is annotated in two ways. One approach concerns annotation of all adjective-noun phrases. In the second approach, literal or metaphorical senses are assigned to all adjectives and nouns in the data. The paper addresses statistics of data and compares two types of annotation. The corpora were used in experiments of automatic recognition of Polish non-literal adjective noun phrases.
2019
pdf
abs
Experiments with ad hoc ambiguous abbreviation expansion
Agnieszka Mykowiecka
|
Malgorzata Marciniak
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)
The paper addresses experiments to expand ad hoc ambiguous abbreviations in medical notes on the basis of morphologically annotated texts, without using additional domain resources. We work on Polish data but the described approaches can be used for other languages too. We test two methods to select candidates for word abbreviation expansions. The first one automatically selects all words in text which might be an expansion of an abbreviation according to the language rules. The second method uses clustering of abbreviation occurrences to select representative elements which are manually annotated to determine lists of potential expansions. We then train a classifier to assign expansions to abbreviations based on three training sets: automatically obtained, consisting of manual annotation, and concatenation of the two previous ones. The results obtained for the manually annotated training data significantly outperform automatically obtained training data. Adding the automatically obtained training data to the manually annotated data improves the results, in particular for less frequent abbreviations. In this context the proposed a priori data driven selection of possible extensions turned out to be crucial.
pdf
abs
Estimating senses with sets of lexically related words for Polish word sense disambiguation
Szymon Rutkowski
|
Piotr Rychlik
|
Agnieszka Mykowiecka
Proceedings of the 10th Global Wordnet Conference
We propose a new algorithm for word sense disambiguation, exploiting data from a WordNet with many types of lexical relations, such as plWordNet for Polish. In this method, sense probabilities in context are approximated with a language model. To estimate the likelihood of a sense appearing amidst the word sequence, the token being disambiguated is substituted with words related lexically to the given sense or words appearing in its WordNet gloss. We test this approach on a set of sense-annotated Polish sentences with a number of neural language models. Our best setup achieves the accuracy score of 55.12% (72.02% when first senses are excluded), up from 51.77% of an existing PageRank-based method. While not exceeding the first (often meaning most frequent) sense baseline in the standard case, this encourages further research on combining WordNet data with neural models.
2018
pdf
abs
Literal, Metphorical or Both? Detecting Metaphoricity in Isolated Adjective-Noun Phrases
Agnieszka Mykowiecka
|
Malgorzata Marciniak
|
Aleksander Wawer
Proceedings of the Workshop on Figurative Language Processing
The paper addresses the classification of isolated Polish adjective-noun phrases according to their metaphoricity. We tested neural networks to predict if a phrase has a literal or metaphorical sense or can have both senses depending on usage. The input to the neural network consists of word embeddings, but we also tested the impact of information about the domain of the adjective and about the abstractness of the noun. We applied our solution to English data available on the Internet and compared it to results published in papers. We found that the solution based on word embeddings only can achieve results comparable with complex solutions requiring additional information.
pdf
abs
Detecting Figurative Word Occurrences Using Recurrent Neural Networks
Agnieszka Mykowiecka
|
Aleksander Wawer
|
Malgorzata Marciniak
Proceedings of the Workshop on Figurative Language Processing
The paper addresses detection of figurative usage of words in English text. The chosen method was to use neural nets fed by pretrained word embeddings. The obtained results show that simple solutions, based on words embeddings only, are comparable to complex solutions, using many sources of information which are not available for languages less-studied than English.
pdf
SimLex-999 for Polish
Agnieszka Mykowiecka
|
Małgorzata Marciniak
|
Piotr Rychlik
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
abs
Detecting Metaphorical Phrases in the Polish Language
Aleksander Wawer
|
Agnieszka Mykowiecka
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
In this paper we describe experiments with automated detection of metaphors in the Polish language. We focus our analysis on noun phrases composed of an adjective and a noun, and distinguish three types of expressions: with literal sense, with metaphorical sense, and expressions both literal and methaphorical (context-dependent). We propose a method of automatically recognizing expression type using word embeddings and neural networks. We evaluate multiple neural network architectures and demonstrate that the method significantly outperforms strong baselines.
pdf
abs
Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambigous Synonyms
Aleksander Wawer
|
Agnieszka Mykowiecka
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications
This paper compares two approaches to word sense disambiguation using word embeddings trained on unambiguous synonyms. The first is unsupervised method based on computing log probability from sequences of word embedding vectors, taking into account ambiguous word senses and guessing correct sense from context. The second method is supervised. We use a multilayer neural network model to learn a context-sensitive transformation that maps an input vector of ambiguous word into an output vector representing its sense. We evaluate both methods on corpora with manual annotations of word senses from the Polish wordnet (plWordnet).
2016
pdf
abs
Recognition of non-domain phrases in automatically extracted lists of terms
Agnieszka Mykowiecka
|
Malgorzata Marciniak
|
Piotr Rychlik
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)
In the paper, we address the problem of recognition of non-domain phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms and discourse function expressions. We tested several methods based on domain corpora comparison and a method based on contexts of phrases identified in a large corpus of general language. We compared the results of the methods to manual annotation. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method with the precision about 0.75 at the half of the tested list was the context based method using a modified contextual diversity coefficient.
pdf
abs
TermoPL - a Flexible Tool for Terminology Extraction
Malgorzata Marciniak
|
Agnieszka Mykowiecka
|
Piotr Rychlik
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The purpose of this paper is to introduce the TermoPL tool created to extract terminology from domain corpora in Polish. The program extracts noun phrases, term candidates, with the help of a simple grammar that can be adapted for user’s needs. It applies the C-value method to rank term candidates being either the longest identified nominal phrases or their nested subphrases. The method operates on simplified base forms in order to unify morphological variants of terms and to recognize their contexts. We support the recognition of nested terms by word connection strength which allows us to eliminate truncated phrases from the top part of the term list. The program has an option to convert simplified forms of phrases into correct phrases in the nominal case. TermoPL accepts as input morphologically annotated and disambiguated domain texts and creates a list of terms, the top part of which comprises domain terminology. It can also compare two candidate term lists using three different coefficients showing asymmetry of term occurrences in this data.
2014
pdf
NPMI Driven Recognition of Nested Terms
Malgorzata Marciniak
|
Agnieszka Mykowiecka
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)
2012
pdf
Combining Wordnet and Morphosyntactic Information in Terminology Clustering
Agnieszka Mykowiecka
|
Malgorzata Marciniak
Proceedings of COLING 2012
2011
pdf
Towards Morphologically Annotated Corpus of Hospital Discharge Reports in Polish
Malgorzata Marciniak
|
Agnieszka Mykowiecka
Proceedings of BioNLP 2011 Workshop
2010
pdf
abs
Domain-related Annotation of Polish Spoken Dialogue Corpus LUNA.PL
Agnieszka Mykowiecka
|
Katarzyna Głowińska
|
Joanna Rabiega-Wiśniewska
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The paper presents a corpus of Polish spoken dialogues annotated on several levels, from transcription of dialogues and their morphosyntactic analysis, to semantic annotation. The LUNA.PL corpus is the first semantically annotated corpus of Polish spontaneous speech. It contains 500 dialogues recorded at the Warsaw Transport Authority call centre. For each dialogue, the corpus contains recorded audio signal, its transcription and five XML files with annotations on subsequent levels. Speech transcription was done manually. Text annotation was constructed using a combination of rule based programmes and computer-aided manual work. For morphological annotation we used the already existing analyzer and manually disambiguated the results. Morphologically annotated texts of dialogues were automatically segmented into elementary syntactic chunks. Semantic annotation was done by a set of specially designed rules and then manually corrected. The paper describes details of the domain related semantic annotation which consists of two levels - concept level at which around 200 attributes and their values are annotated, and predicate level at which 47 frame types are recognized. We describe the domain model accepted, and the statistics over the entire annotated set of dialogues.
2007
pdf
Information Extraction from Patients’ Free Form Documentation
Agnieszka Mykowiecka
|
Malgorzata Marciniak
Biological, translational, and clinical language processing
pdf
Automatic Processing of Diabetic Patients’ Hospital Documentation
Małgorzata Marciniak
|
Agnieszka Mykowiecka
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing
2000
pdf
An HPSG-Annotated Test Suite for Polish
Malgorzata Marciniak
|
Agnieszka Mykowiecka
|
Anna Kupść
|
Adam Przepiórkowski
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)