Daniel Skala
2026
RAGthoven at SemEval-2026 Task 1: A Multi-Stage Pipeline Walks Into a Benchmark and Barely Clears the Bar
Marek Suppa | Viktória Ondrejová | Lucia Ganajová | Gregor Karetka | Daniel Skala
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Marek Suppa | Viktória Ondrejová | Lucia Ganajová | Gregor Karetka | Daniel Skala
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
We present \textsc{RAGthoven}, our system for SemEval-2026 Task~1 (MuWaHaHa), Subtask~A (multilingual constrained humor generation in English, Spanish, and Chinese).\textsc{RAGthoven} decomposes creative text generation into a multi-stage large language model (LLM) pipeline (\textit{Planner}, \textit{Writer}, \textit{Reflector}, \textit{Judge}) grounded in computational humor theories (Benign Violation Theory, Script-based Semantic Theory of Humor) and iteratively refined through prompt engineering across ten experiments.In our final configuration, we augment the Planner with retrieval-augmented generation (RAG) from a curated joke corpus, seeding generation with diverse joke mechanisms.We additionally explore an agentic variant that exposes the same four pipeline stages as tool-calling agents orchestrated by a model loop with a \textsc{ConstraintAudit} checker. While it achieves full constraint compliance, human pairwise evaluation did not reveal a significant quality advantage over the simpler non-agentic baseline.\textsc{RAGthoven} achieves Rank~1 in all three languages, with the strongest result in Spanish (Elo 1182, 42 points above the Gemini~2.5~Flash baseline).However, while the system leads in raw Elo in Spanish, it shares Rank~1 with the baseline in all three languages due to overlapping confidence intervals; in English and Chinese the gap narrows further, suggesting that elaborate multi-stage prompt engineering may offer diminishing returns once a strong frontier model is in the loop.
2024
Bryndza at ClimateActivism 2024: Stance, Target and Hate Event Detection via Retrieval-Augmented GPT-4 and LLaMA
Marek Suppa | Daniel Skala | Daniela Jass | Samuel Sucik | Andrej Svec | Peter Hraska
Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2024)
Marek Suppa | Daniel Skala | Daniela Jass | Samuel Sucik | Andrej Svec | Peter Hraska
Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2024)
This study details our approach for the CASE 2024 Shared Task on Climate Activism Stance and Hate Event Detection, focusing on Hate Speech Detection, Hate Speech Target Identification, and Stance Detection as classification challenges. We explored the capability of Large Language Models (LLMs), particularly GPT-4, in zero- or few-shot settings enhanced by retrieval augmentation and re-ranking for Tweet classification. Our goal was to determine if LLMs could match or surpass traditional methods in this context. We conducted an ablation study with LLaMA for comparison, and our results indicate that our models significantly outperformed the baselines, securing second place in the Target Detection task. The code for our submission is available at https://github.com/NaiveNeuron/bryndza-case-2024
2023
Prompterator: Iterate Efficiently towards More Effective Prompts
Samuel Sučik | Daniel Skala | Andrej Švec | Peter Hraška | Marek Šuppa
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Samuel Sučik | Daniel Skala | Andrej Švec | Peter Hraška | Marek Šuppa
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
With the advent of Large Language Models (LLMs) the process known as prompting, which entices the LLM to solve an arbitrary language processing task without the need for finetuning, has risen to prominence. Finding well-performing prompts, however, is a non-trivial task which requires experimentation in order to arrive at a prompt that solves a specific task. When a given task does not readily reduce to one that can be easily measured with well established metrics, human evaluation of the results obtained by prompting is often necessary. In this work we present prompterator, a tool that helps the user interactively iterate over various potential prompts and choose the best performing one based on human feedback. It is distributed as an open source package with out-of-the-box support for various LLM providers and was designed to be easily extensible.