2023
pdf
abs
TermoUD - a language-independent terminology extraction tool
Malgorzata Marciniak
|
Piotr Rychlik
|
Agnieszka Mykowiecka
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
The paper addresses TermoUD — a language-independent terminology extraction tool. Itsprevious version, i.e. TermoPL (Marciniak et al., 2016; Rychlik et al., 2022), uses languagedependent shallow grammar which selects candidate terms. The goal behind the development of TermoUD is to make the procedure as universal as possible, while taking care of the linguistic correctness of selected phrases. The tool is suitable for languages for which the Universal Dependencies (UD) parser exists. We describe a method of candidate term extraction based on UD POS tags and UD relations. The candidate ranking is performed by the C-value metric (contexts counting is adapted to the UD formalism), which doesn’t need any additional language resources. The performance of the tool has been tested on texts in English, French, Dutch, and Slovenian. The results are evaluated on the manually annotated datasets: ACTER, RD-TEC 2.0, GENIA and RSDO5, and compared to those obtained by other tools.
2020
pdf
abs
Supporting terminology extraction with dependency parses
Malgorzata Marciniak
|
Piotr Rychlik
|
Agnieszka Mykowiecka
Proceedings of the 6th International Workshop on Computational Terminology
Terminology extraction procedure usually consists of selecting candidates for terms and ordering them according to their importance for the given text or set of texts. Depending on the method used, a list of candidates contains different fractions of grammatically incorrect, semantically odd and irrelevant sequences. The aim of this work was to improve term candidate selection by reducing the number of incorrect sequences using a dependency parser for Polish.
2019
pdf
abs
Estimating senses with sets of lexically related words for Polish word sense disambiguation
Szymon Rutkowski
|
Piotr Rychlik
|
Agnieszka Mykowiecka
Proceedings of the 10th Global Wordnet Conference
We propose a new algorithm for word sense disambiguation, exploiting data from a WordNet with many types of lexical relations, such as plWordNet for Polish. In this method, sense probabilities in context are approximated with a language model. To estimate the likelihood of a sense appearing amidst the word sequence, the token being disambiguated is substituted with words related lexically to the given sense or words appearing in its WordNet gloss. We test this approach on a set of sense-annotated Polish sentences with a number of neural language models. Our best setup achieves the accuracy score of 55.12% (72.02% when first senses are excluded), up from 51.77% of an existing PageRank-based method. While not exceeding the first (often meaning most frequent) sense baseline in the standard case, this encourages further research on combining WordNet data with neural models.
2018
pdf
SimLex-999 for Polish
Agnieszka Mykowiecka
|
Małgorzata Marciniak
|
Piotr Rychlik
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
abs
Recognition of non-domain phrases in automatically extracted lists of terms
Agnieszka Mykowiecka
|
Malgorzata Marciniak
|
Piotr Rychlik
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)
In the paper, we address the problem of recognition of non-domain phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms and discourse function expressions. We tested several methods based on domain corpora comparison and a method based on contexts of phrases identified in a large corpus of general language. We compared the results of the methods to manual annotation. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method with the precision about 0.75 at the half of the tested list was the context based method using a modified contextual diversity coefficient.
pdf
abs
TermoPL - a Flexible Tool for Terminology Extraction
Malgorzata Marciniak
|
Agnieszka Mykowiecka
|
Piotr Rychlik
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The purpose of this paper is to introduce the TermoPL tool created to extract terminology from domain corpora in Polish. The program extracts noun phrases, term candidates, with the help of a simple grammar that can be adapted for user’s needs. It applies the C-value method to rank term candidates being either the longest identified nominal phrases or their nested subphrases. The method operates on simplified base forms in order to unify morphological variants of terms and to recognize their contexts. We support the recognition of nested terms by word connection strength which allows us to eliminate truncated phrases from the top part of the term list. The program has an option to convert simplified forms of phrases into correct phrases in the nominal case. TermoPL accepts as input morphologically annotated and disambiguated domain texts and creates a list of terms, the top part of which comprises domain terminology. It can also compare two candidate term lists using three different coefficients showing asymmetry of term occurrences in this data.