Tomas Krilavičius

2025

pdf bib abs
Exploring Hate Speech Detection Models for Lithuanian Language
Justina Mandravickaitė | Eglė Rimkienė | Mindaugas Petkevičius | Milita Songailaitė | Eimantas Zaranka | Tomas Krilavičius
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)

Online hate speech poses a significant challenge, as it can incite violence and contribute to social polarization. This study evaluates traditional machine learning, deep learning and large language models (LLMs) for Lithuanian hate speech detection, addressing class imbalance issue via data augmentation and resampling techniques. Our dataset included 27,358 user-generated comments, annotated into Neutral language (56%), Offensive language (29%) and Hate speech (15%). We trained BiLSTM, LSTM, CNN, SVM, and Random Forest models and fine-tuned Multilingual BERT, LitLat BERT, Electra, RWKV, ChatGPT, LT-Llama-2, and Gemma-2 models. Additionally, we pre-trained Electra for Lithuanian. Models were evaluated using accuracy and weighted F1-score. On the imbalanced dataset, LitLat BERT (0.76 weighted F1-score) and Multilingual BERT (0.73 weighted F1-score) performed best. Over-sampling further boosted weighted F1-scores, with Multilingual BERT (0.85) and LitLat BERT (0.84) outperforming other models. Over-sampling combined with augmentation provided the best overall results. Under-sampling led to performance declines and was less effective. Finally, fine-tuning LLMs improved their accuracy which highlighted the importance of fine-tuning for more specialized NLP tasks.

2017

pdf bib abs
Stylometric Analysis of Parliamentary Speeches: Gender Dimension
Justina Mandravickaitė | Tomas Krilavičius
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Relation between gender and language has been studied by many authors, however, there is still some uncertainty left regarding gender influence on language usage in the professional environment. Often, the studied data sets are too small or texts of individual authors are too short in order to capture differences of language usage wrt gender successfully. This study draws from a larger corpus of speeches transcripts of the Lithuanian Parliament (1990-2013) to explore language differences of political debates by gender via stylometric analysis. Experimental set up consists of stylistic features that indicate lexical style and do not require external linguistic tools, namely the most frequent words, in combination with unsupervised machine learning algorithms. Results show that gender differences in the language use remain in professional environment not only in usage of function words, preferred linguistic constructions, but in the presented topics as well.

pdf bib abs
Identification of Multiword Expressions for Latvian and Lithuanian: Hybrid Approach
Justina Mandravickaitė | Tomas Krilavičius
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

We discuss an experiment on automatic identification of bi-gram multi-word expressions in parallel Latvian and Lithuanian corpora. Raw corpora, lexical association measures (LAMs) and supervised machine learning (ML) are used due to deficit and quality of lexical resources (e.g., POS-tagger, parser) and tools. While combining LAMs with ML is rather effective for other languages, it has shown some nice results for Lithuanian and Latvian as well. Combining LAMs with ML we have achieved 92,4% precision and 52,2% recall for Latvian and 95,1% precision and 77,8% recall for Lithuanian.

2016

pdf bib abs
NLP Infrastructure for the Lithuanian Language
Daiva Vitkutė-Adžgauskienė | Andrius Utka | Darius Amilevičius | Tomas Krilavičius
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Information System for Syntactic and Semantic Analysis of the Lithuanian language (lith. Lietuvių kalbos sintaksinės ir semantinės analizės informacinė sistema, LKSSAIS) is the first infrastructure for the Lithuanian language combining Lithuanian language tools and resources for diverse linguistic research and applications tasks. It provides access to the basic as well as advanced natural language processing tools and resources, including tools for corpus creation and management, text preprocessing and annotation, ontology building, named entity recognition, morphosyntactic and semantic analysis, sentiment analysis, etc. It is an important platform for researchers and developers in the field of natural language technology.