Azwad Anjum Islam


2026

We present a system for SemEval-2026 Task 5 that predicts 1–5 plausibility ratings for candidate senses of homonyms in ambiguous short stories using prompting with closed-source LLMs. We evaluate three prompting strategies: zero-shot, chain-of-thought, and comparative prompting that jointly scores competing senses. We also find simple unweighted ensembling better aligns with subjective human judgments better than individual models. Our official submission ranked 4th on the leaderboard with an average score of 0.86, with post-competition experiments improving performance to 0.89.
This paper presents a two-stage system for the SemEval-2026 shared task on narrative similarity. The task defines similarity in terms of three components: abstract theme, course of action, and outcome. For Track A, the system first applies majority voting over multiple independent large language model (LLM) judgments to handle high-agreement (easy) cases. For low-agreement (difficult) cases, it routes examples to a second stage that decomposes stories into theme, course of action, and outcome, and either (i) scores these components individually with learned weights or (ii) uses structured chain-of-thought prompting to compare stories along the three dimensions. This two-stage approach improves robustness on difficult examples and achieves first place with 0.78 test accuracy. For Track B, the system generates embeddings of full stories and of individual narrative components using several embedding models. Experiments show that embeddings derived from the course-of-action component alone yield the best performance, achieving 0.72 accuracy and ranking first. Additional analyses reveal substantial annotation variability in the dataset and highlight the importance of handling ambiguity and disagreement when modeling narrative similarity.

2025

We present our approach to solving the Narrative Classification portion of the Multilingual Characterization and Extraction of Narratives SemEval-2025 challenge (Task 10, Subtask 2). This task is a multi-label, multi-class document classification task, where the classes were defined via natural language titles, descriptions, short examples, and annotator instructions, with only a few (and sometime no) labeled examples for training. Our approach leverages a text-summarization, binary relevance with zero-shot prompts, and hierarchical prompting using Large Language Models (LLM) to identify the narratives and subnarratives in the provided news articles. Notably, we did not use the labeled examples to train the system. Our approach well outperforms the official baseline and achieves an F1 score of 0.55 (narratives) and 0.43 (subnarratives), and placed 2nd in the test-set leaderboard at the system submission deadline. We provide an in-depth analysis of the construction and effectiveness of our approach using both open-source (LLaMA 3.1-8B-Instruct) and proprietary (GPT 4o-mini) Large Language Models under different prompting setups.
We describe three approaches to solving the Critical Questions Generation Shared Task at ArgMining 2025. The task objective is to automatically generate critical questions that challenge the strength, validity, and credibility of a given argumentative text. The task dataset comprises debate statements (“interventions”) annotated with a list of named argumentation schemes and associated with a set of critical questions (CQs). Our three Retrieval-Augmented Generation (RAG)-based approaches used in-context example selection based on (1) embedding the intervention, (2) embedding the intervention plus manually curated argumentation scheme descriptions as supplementary context, and (3) embedding the intervention plus a selection of associated CQs and argumentation scheme descriptions. We developed the prompt templates through GPT-4o-assisted analysis of patterns in validation data and the task-specific evaluation guideline. All three of our submitted systems outperformed the official baselines (0.44 and 0.53) with automatically computed accuracies of 0.62, 0.58, and 0.61, respectively, on the test data, with our first method securing the 2nd place in the competition (0.63 manual evaluation). Our results highlight the efficacy of LLM-assisted prompt development and RAG-enhanced generation in crafting contextually relevant critical questions for argument analysis.