Iglika Nikolova-Stoupak


2022

pdf
Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair
Iglika Nikolova-Stoupak | Shuichiro Shimizu | Chenhui Chu | Sadao Kurohashi
Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

One of the main challenges within the rapidly developing field of neural machine translation is its application to low-resource languages. Recent attempts to provide large parallel corpora in rare language pairs include the generation of web-crawled corpora, which may be vast but are, unfortunately, excessively noisy. The corpus utilised to train machine translation models in the study is CCMatrix, provided by OPUS. Firstly, the corpus is cleaned based on a number of heuristic rules. Then, parts of it are selected in three discrete ways: at random, based on the “margin distance” metric that is native to the CCMatrix dataset, and based on scores derived through the application of a state-of-the-art classifier model (Acarcicek et al., 2020) utilised in a thematic WMT shared task. The performance of the issuing models is evaluated and compared. The classifier-based model does not reach high performance as compared with its margin-based counterpart, opening a discussion of ways for further improvement. Still, BLEU scores surpass those of Acarcicek et al.’s (2020) paper by over 15 points.

2020

pdf
A Natural Language for Bulgarian Primary and Secondary Education
Iglika Nikolova-Stoupak
Proceedings of the 4th International Conference on Computational Linguistics in Bulgaria (CLIB 2020)

This paper examines the qualities and applicability of a provisional programming language, especially designed for use by beginner-level students in Bulgarian primary and secondary schools. The necessity for such a language is investigated. Then, relevant features are defined, as inspired by various programming languages (notably, languages used in education and characterised with non- English syntax) and by general trends related to the achievement of natural language in software development. A survey is conducted to test young students’ interaction with the language, and the latter’s advantages and limitations are listed and discussed.