Vinay Ulli


2026

This paper describes Team CV’s systems forSemEval-2026 Task 4: Narrative Story Sim-ilarity and Narrative Representation Learn-ing (Hatzel et al., 2026). For Track A (com-parative judgment), we explore five prompt-ing strategies—zero-shot, chain-of-thought,structured feature extraction, pairwise scor-ing, and few-shot—and QLoRA fine-tuningof smaller models. For Track B (narrativeembeddings), we benchmark twelve dedicatedtext embedding models of varying dimen-sionality (384–4096) spanning open-source(E5-Large-v2, BGE, GTE, Qwen3 Embed-ding) and closed-source (OpenAI, Gemini,Mistral) families, and fine-tune Qwen3 Em-bedding 4B on task-specific triples. Few-shot prompting with Qwen-2.5 7B (64.00%)outperforms all fine-tuned variants (best57.50%) on Track A; scaling to LLaMA-3.3-70B yields 75.00%. On Track B, Ope-nAI text-embedding-3-large (3072-d) achieves the best dev accuracy (67.00%),while fine-tuning Qwen3 Embedding 4B(2560-d) on synthetic triples slightly de-creases accuracy. Our final submission—LLaMA-3.3-70B (3-shot) for Track A andtext-embedding-3-large for Track B—achieves 70.75% and 64.50%, exceeding theGPT-4o-mini and STORY-EMB baselines respec-tively.
This paper describes our system for SemEval-2026 Task 9: Detecting Multilingual, Multicul-tural and Multievent Online Polarization. Wefocus on four low-resource Indian languages(Hindi, Bengali, Telugu, and Odia) across threesubtasks: Polarization Detection, Type Classi-fication, and Manifestation Identification. Toaddress data scarcity, we employ cross-lingualdata augmentation using IndicTrans2, expand-ing our dataset fourfold. Our unified architec-ture leverages Qwen3-4B-Instruct optimizedvia QLoRA, training a linear classification headon masked mean-pooled hidden states withonly ∼33M trainable parameters. Our systemachieved highly competitive results in Subtask1, with an average Macro F1 of 0.813 across alllanguages (peaking at 0.8668 for Telugu). Forthe complex multi-label frameworks of Sub-tasks 2 and 3, our results expose a significantpre-training bias within foundational LLMs;while Hindi maintained strong F1 scores of0.7008 and 0.7248, performance dropped con-siderably for the other three languages, high-lighting the ongoing challenges of cross-lingualtransfer for nuanced rhetorical techniques.
This paper describes the system submitted by team Aurum to the Medical Decision Extraction, Analysis, and Classification Task (MedExACT) at BioNLP 2026. The task requires the extraction and classification of contiguous text spans representing medical decisions from lengthy ICU discharge summaries. To address the dual challenges of long document lengths and severe class imbalance withina limited training set of 350 notes, we propose a two-pronged strategy. First, we employ a tripartite data augmentation pipeline utilizing rule-based entity replacement, LLM-based contextual paraphrasing, and synthetic note generation to expand the training data to over 2,300 notes. Second, we fine-tune a domain-specific Clinical Longformer model equipped with a sliding-window inference mechanism and Focal Loss to handle sequences up to 2,048 tokens while focusing on rare decision categories. Paired with a targeted post-processing module,our system achieved a Final Score of 0.5251, demonstrating high token-level detection (Token F1: 0.6311) and strong stability across patient demographics.