Samuel Harvan
2025
When the Dictionary Strikes Back: A Case Study on Slovak Migration Location Term Extraction and NER via Rule-Based vs. LLM Methods
Miroslav Blšták
|
Jaroslav Kopčan
|
Marek Šuppa
|
Samuel Harvan
|
Andrej Findor
|
Martin Takáč
|
Marián Šimko
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
This study explores the task of automatically extracting migration-related locations (source and destination) from media articles, focusing on the challenges posed by Slovak, a low-resource and morphologically complex language. We present the first comparative analysis of rule-based dictionary approaches (NLP4SK) versus Large Language Models (LLMs, e.g. SlovakBERT, GPT-4o) for both geographical relevance classification (Slovakia-focused migration) and specific source/target location extraction. To facilitate this research and future work, we introduce the first manually annotated Slovak dataset tailored for migration-focused locality detection. Our results show that while a fine-tuned SlovakBERT model achieves high accuracy for classification, specialized rule-based methods still have the potential to outperform LLMs for specific extraction tasks, though improved LLM performance with few-shot examples suggests future competitiveness as research in this area continues to evolve.
The Brittle Compass: Navigating LLM Prompt Sensitivity in Slovak Migration Media Discourse
Jaroslav Kopčan
|
Samuel Harvan
|
Marek Suppa
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
In this work, we present a case study that explores various tasks centered around the topic of migration in Slovak, a low-resource language, such as topic relevance and geographical relevance classification, and migration source/destination location term extraction. Our results demonstrate that native (Slovak)prompts yield a modest, task-dependent gain, while large models show significant robustness to prompt variations compared to their smaller counterparts. Analysis reveals that instructions(system or task) emerge as the most critical prompt component, more so than the examples sections, with task-specific performance benefits being more pronounced than overall language effects.
Search
Fix author
Co-authors
- Jaroslav Kopčan 2
- Marek Šuppa 2
- Miroslav Blšták 1
- Andrej Findor 1
- Martin Takáč 1
- show all...