Nazarii Drushchak

2026

Adapting large language models to low-resource languages presents three interconnected challenges: inefficient tokenization, scarcity of high-quality annotated data, and limited resources for instruction tuning. We present a reproducible approach that addresses each challenge using data-centric methods that primarily rely on unlabeled text corpora, parallel translation data, and a multilingual base model. Our approach combines (1) vocabulary surgery for tokenizer adaptation without full retraining, (2) cross-lingual transfer of quality classifiers via translation, enabling filtering without target-language annotations, and (3) generation of instruction data through translation, task conversion, and targeted synthesis. We validate this recipe by adapting Gemma-3-12B to Ukrainian. %, producing Lapa-12BOur pretrained model achieves top performance on Ukrainian benchmarks, while our instruction-tuned variant demonstrates strong performance on translation (33 BLEU on FLORES), summarization, and question-answering tasks, while requiring 1.5x fewer tokens than the original model for the same text. We release all models, datasets, classifiers, and code to enable replication for other languages.

2025

pdf bib abs

GBEM-UA: Gender Bias Evaluation and Mitigation for Ukrainian Large Language Models
Mykhailo Buleshnyi | Maksym Buleshnyi | Marta Sumyk | Nazarii Drushchak
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)

Large Language Models (LLMs) have demonstrated remarkable performance across various domains, but they often inherit biases present in the data they are trained on, leading to unfair or unreliable outcomes—particularly in sensitive areas such as hiring, medical decision-making, and education. This paper evaluates gender bias in LLMs within the Ukrainian language context, where the gendered nature of the language and the use of feminitives introduce additional complexity to bias analysis. We propose a benchmark for measuring bias in Ukrainian and assess several debiasing methods, including prompt debiasing, embedding debiasing, and fine-tuning, to evaluate their effectiveness. Our results suggest that embedding debiasing alone is insufficient for a morphologically rich language like Ukrainian, whereas fine-tuning proves more effective in mitigating bias for domain-specific tasks.

pdf bib abs

UAlign: LLM Alignment Benchmark for the Ukrainian Language
Andrian Kravchenko | Yurii Paniv | Nazarii Drushchak
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)

This paper introduces UAlign, the comprehensive benchmark for evaluating the alignment of Large Language Models (LLMs) in the Ukrainian language. The benchmark consists of two complementary components: a moral judgment dataset with 3,682 scenarios of varying ethical complexities and a dataset with 1,700 ethical situations presenting clear normative distinctions. Each element provides parallel English-Ukrainian text pairs, enabling cross-lingual comparison. Unlike existing resources predominantly developed for high-resource languages, our benchmark addresses the critical need for evaluation resources in Ukrainian. The development process involved machine translation and linguistic validation using Ukrainian language models for grammatical error correction. Our cross-lingual evaluation of six LLMs confirmed the existence of a performance gap between alignment in Ukrainian and English while simultaneously providing valuable insights regarding the overall alignment capabilities of these models. The benchmark has been made publicly available to facilitate further research initiatives and enhance commercial applications.Warning: The datasets introduced in this paper contain sensitive materials related to ethical and moral scenarios that may include offensive, harmful, illegal, or controversial content.

pdf bib abs

Improving Named Entity Recognition for Low-Resource Languages Using Large Language Models: A Ukrainian Case Study
Vladyslav Radchenko | Nazarii Drushchak
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP), yet achieving high performance for low-resource languages remains challenging due to limited annotated data and linguistic complexity. Ukrainian exemplifies these issues with its rich morphology and scarce NLP resources. Recent advances in Large Language Models (LLMs) demonstrate their ability to generalize across diverse languages and domains, offering promising solutions without extensive annotations. This research explores adapting state-of-the-art LLMs to Ukrainian through prompt engineering, including chain-of-thought (CoT) strategies, and model refinement via Supervised Fine-Tuning (SFT). Our best model achieves 0.89 F1 on the NER-UK 2.0 benchmark, matching the performance of advanced encoder-only baselines. These findings highlight practical pathways for improving NER in low-resource contexts, promoting more accessible and scalable language technologies.

pdf bib abs

Multimodal Retrieval-Augmented Generation: Unified Information Processing Across Text, Image, Table, and Video Modalities
Nazarii Drushchak | Nataliya Polyakovska | Maryna Bautina | Taras Semenchenko | Jakub Koscielecki | Wojciech Sykala | Michal Wegrzynowski
Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025)

Retrieval-augmented generation (RAG) is a powerful paradigm for leveraging external data to enhance the capabilities of large language models (LLMs). However, most existing RAG solutions are tailored for single-modality or limited multimodal scenarios, restricting their applicability in real-world contexts where diverse data sources—including text, tables, images, and videos—must be integrated seamlessly. In this work proposes a unified Multimodal Retrieval-augmented generation (mRAG) system designed to unify information processing across all four modalities. Our pipeline ingests and indexes data from PDFs and videos using tools like Amazon Textract, Transcribe, Langfuse, and multimodal LLMs (e.g., Claude 3.5 Sonnet) for structured extraction and semantic enrichment. The dataset includes text queries, table lookups, image-based questions, and videos. Evaluation with the Deepeval framework shows improved retrieval accuracy and response quality, especially for structured text and tables. While performance on image and video queries is lower, the multimodal integration framework remains robust, underscoring the value of unified pipelines for diverse data.

2024

pdf bib abs

Introducing the Djinni Recruitment Dataset: A Corpus of Anonymized CVs and Job Postings
Nazarii Drushchak | Mariana Romanyshyn
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024

This paper introduces the Djinni Recruitment Dataset, a large-scale open-source corpus of candidate profiles and job descriptions. With over 150,000 jobs and 230,000 candidates, the dataset includes samples in English and Ukrainian, thereby facilitating advancements in the recruitment domain of natural language processing (NLP) for both languages. It is one of the first open-source corpora in the recruitment domain, opening up new opportunities for AI-driven recruitment technologies and related fields. Notably, the dataset is accessible under the MIT license, encouraging widespread adoption for both scientific research and commercial projects.