Aleksandar Petrovski


A Parallel English - Serbian - Bulgarian - Macedonian Lexicon of Named Entities
Aleksandar Petrovski
Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

This paper describes the creation of a parallel multilingual lexicon of named entities from English to three South Slavic languages: Serbian, Bulgarian and Macedonian, with Wikipedia as a source. The basics of the proposed methodology are well known. This methodology provides a cheap opportunity to build multilingual lexicons, without having expertise in target languages. Wikipedia’s database dump can be freely downloaded in SQL and XML formats. The method presented here has been used to build a Python application that extracts the English – Serbian – Bulgarian – Macedonian parallel titles from Wikipedia and classifies them using the English Wikipedia category system. The extracted named entity sets have been classified into five classes: PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC (miscellaneous). It has been achieved using Wikipedia metadata. The quality of classification has been checked manually on 1,000 randomly chosen named entities. The following are the results obtained: 97% for precision and 90% for recall.