Huizhi(elly) Liang


2026

Abductive Event Reasoning (AER) requires selecting plausible causal explanations for observed events from incomplete and noisy textual evidence. Unlike deductive reasoning, abductive inference proceeds from effects to candidate causes and is highly sensitive to distractor information and implicit multi-hop relationships. We present a hybrid neural-symbolic framework that models abductive reasoning as structured causal validation rather than unconstrained generation. Our framework integrates hybrid retrieval, micro-level evidence grounding, concept-level causal abstraction, reinforcement learning-based decision calibration, and structured Theorem-of-Thought verification. Experiments on SemEval-2026 Task 12 show that LLM reasoning constrained by structured causal graphs achieves the strongest development performance of 0.5288 and a leaderboard score of 0.61 on the test set, substantially outperforming symbolic-only and policy-only variants. These findings indicate that explicit causal modelling improves robustness in document-grounded abduction tasks.
Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1–5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task.
Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence–arousal (VA) regression. This paper describes a system developed for Track A, Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, using dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language–domain pair (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models under a few-shot prompting setting, demonstrating that task-specific fine-tuning outperforms these LLM-based methods across all evaluation datasets.
SemEval-2026 Task 4 on Narrative Similarity requires models to assess narrative alignment between stories rather than relying on surface lexical similarity. For Track A, we introduce the Aspect-Based Narrative Similarity Agents(ABNS-Agents), a two-stage agent-based framework. It extracts three core narrative aspects aligned with the task definition under a schema constraint, and then performs aspect-aligned similarity adjudication using an LLM decision model. For Track B, Narrative Supervised Contrastive Embeddings(NSConE) is based upon supervised contrastive learning to model narrative similarity. Our experiments show that ABNS-Agents achieves 70.25% accuracy on the test set, while NSConE reaches 68.5% test accuracy, demonstrating competitive performance across both reasoning-based and representation-learning paradigms. The findings highlight the effectiveness of aspect-aligned structured modelling and task-specific supervised contrastive learning for capturing narrative similarity beyond surface semantics.
We introduce a three-stage training framework for abductive event reasoning(AER). The task dataset were decomposed into 3 subsets, causal judgment, cause generation, and multiple choice answering(MCQA). Abductive reasoning requires understanding complex causal relationships between events. However, small language models typically struggle due to the multi-step inference required. Our approach provided supervised fine-tuning with group relative policy optimization(GRPO) to enlarge the reasoning capabilities based on an 0.5b parameter model. On the SemEval-2026 Task 12 development set, out Casual-Qwen 0.5B model achieves $64.75\%$, abslute outperforming $63.78\%$ Qwen2.5:0.5b at $0.0975\%$. Our ablation study reveals that binary casual judgement rather than cause generation or direct MCQA training is the key skill for AER task, with more complex stages significantly underperforming due to the task misalignment or task complexicity.