2016
pdf
abs
Rule-based Automatic Multi-word Term Extraction and Lemmatization
Ranka Stanković
|
Cvetana Krstev
|
Ivan Obradović
|
Biljana Lazić
|
Aleksandra Trtovac
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from the mining domain containing more than 600,000 simple word forms. Extracted and lemmatized multi-word terms are filtered in order to reject falsely offered lemmas and then ranked by introducing measures that combine linguistic and statistical information (C-Value, T-Score, LLR, and Keyness). Mean average precision for retrieval of MWU forms ranges from 0.789 to 0.804, while mean average precision of lemma production ranges from 0.956 to 0.960. The evaluation showed that 94% of distinct multi-word forms were evaluated as proper multi-word units, and among them 97% were associated with correct lemmas.
2012
pdf
abs
A tool for enhanced search of multilingual digital libraries of e-journals
Ranka Stanković
|
Cvetana Krstev
|
Ivan Obradović
|
Aleksandra Trtovac
|
Miloš Utvić
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper outlines the main features of Bibliša, a tool that offers various possibilities of enhancing queries submitted to large collections of TMX documents generated from aligned parallel articles residing in multilingual digital libraries of e-journals. The queries initiated by a simple or multiword keyword, in Serbian or English, can be expanded by Bibliša, both semantically and morphologically, using different supporting monolingual and multilingual resources, such as wordnets and electronic dictionaries. The tool operates within a complex system composed of several modules including a web application, which makes it readily accessible on the web. Its functionality has been tested on a collection of 44 TMX documents generated from articles published bilingually by the journal INFOtecha, yielding encouraging results. Further enhancements of the tool are underway, with the aim of transforming it from a powerful full-text and metadata search tool, to a useful translator's aid, which could be of assistance both in reviewing terminology used in context and in refining the multilingual resources used within the system.
2011
pdf
E-Dictionaries and Finite-State Automata for the Recognition of Named Entities
Cvetana Krstev
|
Duško Vitas
|
Ivan Obradović
|
Miloš Utvić
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing
2010
pdf
abs
GIS Application Improvement with Multilingual Lexical and Terminological Resources
Ranka Stanković
|
Ivan Obradović
|
Olivera Kitanović
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper introduces the results of integration of lexical and terminological resources, most of them developed within the Human Language Technology (HLT) Group at the University of Belgrade, with the Geological information system of Serbia (GeolISS) developed at the Faculty of Mining and Geology and funded by the Ministry of the Environmental protection. The approach to GeolISS development, which is aimed at the integration of existing geologic archives, data from published maps on different scales, newly acquired field data, and intranet and internet publishing of geologic is given, followed by the description of the geologic multilingual vocabulary and other lexical and terminological resources used. Two basic results are outlined: multilingual map annotation and improvement of queries for the GeolISS geodatabase. Multilingual labelling and annotation of maps for their graphic display and printing have been tested with Serbian, which describes regional information in the local language, and English, used for sharing geographic information with the world, although the geological vocabulary offers the possibility for integration of other languages as well. The resources also enable semantic and morphological expansion of queries, the latter being very important in highly inflective languages, such as Serbian.
2008
pdf
abs
The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
|
Ivan Obradović
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we present how resources and tools developed within the Human Language Technology Group at the University of Belgrade can be used for tuning queries before submitting them to a web search engine. We argue that the selection of words chosen for a query, which are of paramount importance for the quality of results obtained by the query, can be substantially improved by using various lexical resources, such as morphological dictionaries and wordnets. These dictionaries enable semantic and morphological expansion of the query, the latter being very important in highly inflective languages, such as Serbian. Wordnets can also be used for adding another language to a query, if appropriate, thus making the query bilingual. Problems encountered in retrieving documents of interest are discussed and illustrated by examples. A brief description of resources is given, followed by an outline of the web tool which enables their integration. Finally, a set of examples is chosen in order to illustrate the use of the lexical resources and tool in question. Results obtained for these examples show that the number of documents obtained through a query by using our approach can double and even quadruple in some cases.
2006
pdf
abs
WS4LR: A Workstation for Lexical Resources
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
|
Ivan Obradović
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In this paper we describe WS4LR, the workstation for lexical resources, a software tool developed within the Human Language Technology Group at the Faculty of Mathematics, University of Belgrade. The tool is aimed at manipulating heterogeneous lexical resources, and the need for such a tool came from the large volume of resources the Group has developed in the course of many years and within different projects. The tool handles morphological dictionaries, wordnets, aligned texts and transducers equally and has already proved very useful for various tasks. Although it has so far been used mainly for Serbian, WS4LR is not language dependent and can be successfully used for resources in other languages provided that they follow the described formats and methodologies. The tool operates on the .NET platform and runs on a personal computer under Windows 2000/XP/2003 operating system with at least 256MB of internal memory.
2004
pdf
Combining Heterogeneous Lexical Resources
Cvetana Krstev
|
Duško Vitas
|
Ranka Stankoviæ
|
Ivan Obradoviæ
|
Gordana Pavloviæ-Lažetiæ
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)