Athanasios Voulodimos

2026

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection
Christos Tzouvaras | Konstantinos Skianis | Athanasios Voulodimos
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place and tied with the second-best reportedscore.

pdf bib abs

AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning
Nikolaos Karafyllis | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present a winning three-stage system for SemEval 2026 Task 12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design informed by reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95. Cross-model error analysis across 14 models (7 families) reveals three shared inductive biases: causal chain incompleteness, proximate cause preference, and salience bias, whose cross-family convergence (51% cause-count reduction) indicates systematic rather than model-specific failure modes in multi-label causal reasoning.

pdf bib abs

AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations
Dimosthenis Athanasiou | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We describe the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG (C).Our approach is based on two main design principles. First, we adopt a query-diversity-over-retriever-diversity strategy, where multiple complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and combined using a variance-aware nested Reciprocal Rank Fusion scheme. Second, we employ an agentic generation pipeline that decomposes grounded response generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection.The proposed system achieves strong performance across subtasks, ranking first in Task A and second in Task B in the official evaluation. Our empirical findings indicate that query diversity over a well-aligned retriever is more effective than heterogeneous retriever ensembling, and that answerability calibration—rather than retrieval coverage—emerges as the primary bottleneck in end-to-end performance.

pdf bib abs

AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis
Stavros Gazetas | Giorgos Filandrianos | Maria Lymperaiou | Paraskevi Tzouveli | Athanasios Voulodimos | Giorgos Stamou
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.

pdf bib abs

AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection
Panagiotis Spanakis | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates and addresses these challenges separately. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an “Anti-Echo Chamber“ architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the “Reporter Trap“, where models falsely penalize objective reporting. Our system achieves 0.24 Macro F1 (+100% over baseline) on S1 and 0.79 Macro F1 (+49%) on S2, ranking 3rd on the S1 development leaderboard and 8th on the test set, demonstrating that structured agentic deliberation is an effective alternative to fine-tuning for interpretable psycholinguistic NLP.

pdf bib abs

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care
Vassilis Lyberatos | Edmund Dervakos | Eleni Adamidi | Athanasios Voulodimos | Giorgos Stamou
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical–syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.

2025

pdf bib abs

AILS-NTUA at SemEval-2025 Task 8: Language-to-Code prompting and Error Fixing for Tabular Question Answering
Andreas Evangelatos | George Filandrianos | Maria Lymperaiou | Athanasios Voulodimos | Giorgos Stamou
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

In this paper, we present our submission to SemEval-2025 Task 8: Question Answering over Tabular Data. This task, evaluated on the DataBench dataset, assesses Large Language Models’ (LLMs) ability to answer natural language questions over structured data while addressing topic diversity and table size limitations in previous benchmarks. We propose a system that employs effective LLM prompting to translate natural language queries into executable code, enabling accurate responses, error correction, and interpretability. Our approach ranks first in both subtasks of the competition in the proprietary model category, significantly outperforming the organizer’s baseline.

pdf bib abs

AILS-NTUA at SemEval-2025 Task 4: Parameter-Efficient Unlearning for Large Language Models using Data Chunking
Iraklis Premptis | Maria Lymperaiou | George Filandrianos | Orfeas Menis Mastromichalakis | Athanasios Voulodimos | Giorgos Stamou
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

The {textit{Unlearning Sensitive Content from Large Language Models}} task aims to remove targeted datapoints from trained models while minimally affecting their general knowledge. In our work, we leverage parameter-efficient, gradient-based unlearning using low-rank (LoRA) adaptation and layer-focused fine-tuning. To further enhance unlearning effectiveness, we employ data chunking, splitting forget data into disjoint partitions and merging them with cyclically sampled retain samples at a pre-defined ratio. Our task-agnostic method achieves an outstanding forget-retain balance, ranking first on leaderboards and significantly outperforming baselines and competing systems.

pdf bib abs

AILS-NTUA at SemEval-2025 Task 3: Leveraging Large Language Models and Translation Strategies for Multilingual Hallucination Detection
Dimitra Karkani | Maria Lymperaiou | George Filandrianos | Nikolaos Spanos | Athanasios Voulodimos | Giorgos Stamou
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Multilingual hallucination detection stands as an underexplored challenge, which the Mu-SHROOM shared task seeks to address. In this work, we propose an efficient, training-free LLM prompting strategy that enhances detection by translating multilingual text spans into English. Our approach achieves competitive rankings across multiple languages, securing two first positions in low-resource languages. The consistency of our results highlights the effectiveness of our translation strategy for hallucination detection, demonstrating its applicability regardless of the source language.

pdf bib abs

Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models
Elena Stringli | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
Findings of the Association for Computational Linguistics: ACL 2025

Inverse tasks can uncover potential reasoning gaps as Large Language Models (LLMs) scale up. In this work, we explore the redefinition task, in which we assign alternative values to well-known physical constants and units of measure, prompting LLMs to respond accordingly. Our findings show that not only does model performance degrade with scale, but its false confidence also rises. Moreover, while factors such as prompting strategies or response formatting are influential, they do not preclude LLMs from anchoring to memorized values.