Tisa Islam Erana
2026
COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives
Azwad Anjum Islam | Tisa Islam Erana
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Azwad Anjum Islam | Tisa Islam Erana
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
We present a system for SemEval-2026 Task 5 that predicts 1–5 plausibility ratings for candidate senses of homonyms in ambiguous short stories using prompting with closed-source LLMs. We evaluate three prompting strategies: zero-shot, chain-of-thought, and comparative prompting that jointly scores competing senses. We also find simple unweighted ensembling better aligns with subjective human judgments better than individual models. Our official submission ranked 4th on the leaderboard with an average score of 0.86, with post-competition experiments improving performance to 0.89.
COGNAC at SemEval-2026 Task 4: Evaluating Narrative Components with LLMs for Hard Story Similarity Cases
Tisa Islam Erana | Azwad Anjum Islam | Anshu Sharma | Mark Finlayson
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Tisa Islam Erana | Azwad Anjum Islam | Anshu Sharma | Mark Finlayson
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
This paper presents a two-stage system for the SemEval-2026 shared task on narrative similarity. The task defines similarity in terms of three components: abstract theme, course of action, and outcome. For Track A, the system first applies majority voting over multiple independent large language model (LLM) judgments to handle high-agreement (easy) cases. For low-agreement (difficult) cases, it routes examples to a second stage that decomposes stories into theme, course of action, and outcome, and either (i) scores these components individually with learned weights or (ii) uses structured chain-of-thought prompting to compare stories along the three dimensions. This two-stage approach improves robustness on difficult examples and achieves first place with 0.78 test accuracy. For Track B, the system generates embeddings of full stories and of individual narrative components using several embedding models. Experiments show that embeddings derived from the course-of-action component alone yield the best performance, achieving 0.72 accuracy and ranking first. Additional analyses reveal substantial annotation variability in the dataset and highlight the importance of handling ambiguity and disagreement when modeling narrative similarity.
2025
COGNAC at CQs-Gen 2025: Generating Critical Questions with LLM-Assisted Prompting and Multiple RAG Variants
Azwad Anjum Islam | Tisa Islam Erana | Mark A. Finlayson
Proceedings of the 12th Argument mining Workshop
Azwad Anjum Islam | Tisa Islam Erana | Mark A. Finlayson
Proceedings of the 12th Argument mining Workshop
We describe three approaches to solving the Critical Questions Generation Shared Task at ArgMining 2025. The task objective is to automatically generate critical questions that challenge the strength, validity, and credibility of a given argumentative text. The task dataset comprises debate statements (“interventions”) annotated with a list of named argumentation schemes and associated with a set of critical questions (CQs). Our three Retrieval-Augmented Generation (RAG)-based approaches used in-context example selection based on (1) embedding the intervention, (2) embedding the intervention plus manually curated argumentation scheme descriptions as supplementary context, and (3) embedding the intervention plus a selection of associated CQs and argumentation scheme descriptions. We developed the prompt templates through GPT-4o-assisted analysis of patterns in validation data and the task-specific evaluation guideline. All three of our submitted systems outperformed the official baselines (0.44 and 0.53) with automatically computed accuracies of 0.62, 0.58, and 0.61, respectively, on the test data, with our first method securing the 2nd place in the competition (0.63 manual evaluation). Our results highlight the efficacy of LLM-assisted prompt development and RAG-enhanced generation in crafting contextually relevant critical questions for argument analysis.