\documentclass[11pt]{article} \usepackage[final]{acl} \usepackage{times} \usepackage{latexsym} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{microtype} \usepackage{inconsolata} \usepackage{amsmath,amssymb,amsfonts} \usepackage{graphicx} \usepackage{xcolor} \usepackage{booktabs} \usepackage{multirow} \begin{document} % ========================================== % TITLE & AUTHORS % ========================================== \title{Team HausaNLP at SemEval-2026 Task 4: Narratives via Semantic Embeddings} \author{ \begin{tabular}{c} Faisal Muhammad Adam$^{1}$ \qquad Sani Aji$^{2}$ \\ Lukman Jibril Aliyu$^{3}$ \\ $^{1}$ACETEL, National Open University of Nigeria \\ $^{2}$Department of Mathematics, Faculty of Science, Gombe State University, Gombe, Nigeria \\ $^{3}$HausaNLP \\ \texttt{faisaladamm@gmail.com} \qquad \texttt{ajysani@yahoo.com} \\ \texttt{lukman.j.aliyu@gmail.com} \end{tabular} } \maketitle % ========================================== % ABSTRACT % ========================================== \begin{abstract} This paper presents Team HausaNLP's submission to SemEval-2026 Task 4 (Track A), which requires identifying the more narratively similar of two candidate stories relative to an anchor. Narrative similarity is defined along three dimensions: abstract theme, course of action, and story outcomes. We conduct a systematic ablation comparing five approaches: a lexical TF-IDF baseline, two bi-encoder SBERT variants (\texttt{all-MiniLM-L6-v2} and \texttt{all-mpnet-base-v2}), a paraphrase-focused embedding model, and a cross-encoder re-ranker. On the 200-instance development set, \texttt{all-mpnet-base-v2} achieves the best performance (61.5\% accuracy, 61.48 macro-F1), outperforming both TF-IDF (54.5\%) and the official SBERT baseline (55.0\%). Surprisingly, the cross-encoder re-ranker (55.5\%) does not improve on the bi-encoders, which we attribute to the long-document nature of Wikipedia story summaries exceeding the model's effective context window. On the official test set, our primary SBERT MiniLM submission achieved 61.50\% accuracy (33rd of 44 teams). Our error analysis over 200 development instances identifies five systematic failure categories, distinct from the All Correct / Partial cases, including 23 Lexical Trap cases, 23 Hard Cases, and 24 Proposed-Recovery cases, thereby informing concrete directions for future work. \end{abstract} % ========================================== % 1. INTRODUCTION % ========================================== \section{Introduction} Narrative understanding is a long-standing challenge in Natural Language Processing (NLP). Stories can share deep thematic and structural commonalities while exhibiting minimal surface-level lexical overlap---a property that exposes the limitations of traditional retrieval approaches and motivates the use of dense semantic representations. SemEval-2026 Task 4~\cite{hatzel2026semeval} formalises this challenge as a comparative judgment task: given an \textit{anchor} story and two candidate continuations or thematic variants, a system must determine which candidate is more narratively similar to the anchor. Narrative similarity is defined by three core components: (1) \textit{abstract theme}---the underlying ideas and motives; (2) \textit{course of action}---the sequence of central events and turning points; and (3) \textit{outcomes}---the resulting story resolutions. All story summaries are sourced from English Wikipedia, yielding over 1,000 annotated triples. In this work, we investigate the degree to which lexical versus semantic representations capture narrative similarity. We hypothesise that narrative alignment is fundamentally a semantic phenomenon: two stories may share almost no vocabulary yet describe structurally identical events, while a lexically similar distractor can mislead keyword-based systems. Our ablation across five systems on the 200-instance development split reveals a more nuanced picture than expected: while the larger bi-encoder \texttt{all-mpnet-base-v2} (61.5\%) clearly outperforms TF-IDF (54.5\%) and the official SBERT MiniLM baseline (55.0\%), a cross-encoder re-ranker fails to improve further (55.5\%), suggesting that long-document narrative summaries pose challenges for joint-encoding architectures. Our official Track A submission, using \texttt{all-MiniLM-L6-v2}, achieved 61.50\% on the test set, ranking 33rd among 44 competing teams~\cite{hatzel2026semeval}. % ========================================== % 2. RELATED WORK % ========================================== \section{Related Work} \paragraph{Narrative similarity and representation.} Computational approaches to narrative similarity have a rich history rooted in story grammar formalisms~\cite{rumelhart1975notes} and script-based event representations~\cite{schank1977scripts}. More recently,~\cite{hatzel2024story} introduced narrative-focused story embeddings derived from Wikipedia plot summaries, demonstrating that general-purpose sentence encoders fall short on deep narrative alignment tasks compared to domain-adapted representations. Their work directly motivates the SemEval-2026 Task 4 benchmark. \paragraph{Sentence and document embeddings.} Sentence-BERT (SBERT)~\cite{reimers2019sentence} extended BERT~\cite{devlin2019bert} with a siamese network architecture, enabling efficient computation of semantically meaningful sentence embeddings via cosine similarity. The \texttt{all-MiniLM-L6-v2} and \texttt{all-mpnet-base-v2} variants are trained on over one billion sentence pairs and serve as strong general-purpose baselines for semantic similarity tasks~\cite{wang2020minilm}. \paragraph{Cross-encoders for ranking.} Cross-encoders jointly encode a query–candidate pair, enabling richer attention-based interactions between the two texts at the cost of higher computational overhead~\cite{humeau2020polyencoders}. In information retrieval pipelines, cross-encoders are typically used as re-rankers on top of bi-encoder shortlists~\cite{nogueira2019passage}. For narrative comparison, where subtle thematic coherence matters more than keyword overlap, cross-encoders represent a natural fit. \paragraph{LLMs for narrative tasks.} Recent systems at SemEval-2026 Task 4 have demonstrated the effectiveness of large language models (LLMs) for narrative comparison, with LLM-based voting ensembles achieving up to 78\% test accuracy~\cite{hatzel2026semeval}. However, LLM-based approaches require significant compute; our work focuses on efficient embedding-based methods that remain accessible under resource constraints. % ========================================== % 3. METHODOLOGY % ========================================== \section{Methodology} \subsection{Dataset} The SemEval-2026 Task 4 dataset~\cite{hatzel2026semeval} consists of annotated triples of Wikipedia story summaries. Each instance contains an \textit{anchor text} and two candidate texts (\textit{text\_a} and \textit{text\_b}). The goal is to predict which candidate is more narratively similar to the anchor. We report development-set results for our ablation study (see Section~\ref{sec:results}) and official test-set results where available. \subsection{Preprocessing} All input texts undergo the following preprocessing: (1) stripping of leading/trailing whitespace; (2) collapsing of multiple whitespace characters into a single space. For the TF-IDF baseline only, texts are additionally lowercased. Neural models receive the original mixed-case text, as pre-trained transformers are case-sensitive and casing can carry narrative-relevant information (e.g., proper nouns denoting characters or locations). \subsection{Implementation Details and Reproducibility} All systems are used in a zero-shot inference setting; we do not fine-tune any model on the SemEval-2026 Task 4 data. For every development or test instance, the decision rule is deterministic: we compute one similarity score between the anchor and each candidate, then select the candidate with the higher score. For TF-IDF, we use English stop-word removal and fit the vectorizer on each instance triple only. For the neural bi-encoders, we use the publicly available Sentence-Transformers checkpoints exactly as released, encode each text independently, and compare L2-normalised embeddings with cosine similarity. For the cross-encoder, we score the two anchor--candidate pairs independently with \texttt{cross-encoder/stsb-roberta-large} and choose the higher-scoring candidate. No task-specific hyperparameter tuning is performed beyond model selection on the development set. This design keeps the comparison focused on representational differences between lexical, bi-encoder, paraphrase-oriented, and cross-encoder approaches. Because the story summaries are substantially longer than typical semantic textual similarity inputs, the cross-encoder is especially sensitive to input-length limits; this is one reason we analyse its behaviour separately in Section~\ref{sec:crossencoder}. Our code follows the same preprocessing and scoring pipeline for development and test data, making the experiments straightforward to reproduce from the model names and decision rules reported here. \subsection{System 1: TF-IDF Baseline (Lexical)} As a lexical baseline, we implement a Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer with English stop-word removal. For each instance, we fit the vectorizer on the triple $\{A, T_A, T_B\}$ and compute cosine similarities $\cos(A, T_A)$ and $\cos(A, T_B)$. The candidate with the higher similarity score is selected. \subsection{System 2: Bi-Encoder SBERT (Official Baseline)} The official task baseline is SBERT using the \texttt{all-MiniLM-L6-v2} model~\cite{reimers2019sentence}, which maps texts to 384-dimensional dense vectors. We replicate this approach exactly, encoding all texts into L2-normalised embeddings and selecting the candidate with the higher cosine similarity (equivalent to dot product under normalisation). \subsection{System 3: Stronger Bi-Encoder (\texttt{all-mpnet-base-v2})} To assess the impact of model capacity within the bi-encoder paradigm, we experiment with \texttt{all-mpnet-base-v2}, a larger model producing 768-dimensional embeddings, trained on the same large-scale semantic similarity corpora. The decision procedure is identical to System 2. This system constitutes our primary proposed contribution, as it achieves the best development-set performance in our ablation. Figure~\ref{fig:siamese_architecture} illustrates the siamese bi-encoder setup used by our SBERT-based systems: the anchor and candidate stories are encoded independently with shared weights, and narrative similarity is computed via cosine similarity in the embedding space. \begin{figure}[htbp] \centering \includegraphics[width=\linewidth]{siamese_biencoder.png} \caption{Siamese bi-encoder architecture used for our SBERT-based narrative similarity systems. The anchor and each candidate story are passed through the same encoder, and cosine similarity determines the preferred match.} \label{fig:siamese_architecture} \end{figure} \subsection{System 4: Paraphrase Bi-Encoder (\texttt{paraphrase-mpnet-base-v2})} We additionally evaluate \texttt{paraphrase-mpnet-base-v2}, a model fine-tuned specifically for paraphrase detection and paraphrase-aware similarity. Given that narrative similarity can involve semantically equivalent events expressed with very different surface forms, we hypothesise that paraphrase-aware representations may be particularly well-suited to the task. \subsection{System 5: Cross-Encoder Re-Ranker} \label{sec:crossencoder} Cross-encoders jointly encode query–candidate pairs through full self-attention, allowing each token in the anchor to attend to every token in the candidate~\cite{humeau2020polyencoders}. We evaluate \texttt{cross-encoder/stsb-roberta-large} as a re-ranking approach. Formally: \begin{itemize} \item \textbf{Input:} Anchor story ($A$), Option 1 ($T_A$), Option 2 ($T_B$). \item \textbf{Scoring:} $S_A = f_\theta([A; T_A])$,\; $S_B = f_\theta([A; T_B])$, where $f_\theta$ is the cross-encoder. \item \textbf{Decision:} Predict Option 1 if $S_A > S_B$, else Option 2. \end{itemize} While cross-encoders typically outperform bi-encoders in information retrieval settings~\cite{nogueira2019passage}, the Wikipedia story summaries in this task are considerably longer than typical sentence-pair inputs. We include this system to examine whether the richer cross-attention mechanism compensates for this domain mismatch. % ========================================== % 4. RESULTS AND ANALYSIS % ========================================== \section{Results and Discussion} \label{sec:results} \subsection{Ablation: System Comparison} Table~\ref{tab:ablation} reports accuracy, macro-F1, macro-precision, and macro-recall for all five systems on the 200-instance development set. The results reveal a more nuanced picture than a simple lexical-to-semantic progression. \begin{table}[ht] \centering \small \caption{Development-set results ($n=200$). Acc = accuracy, F1 = macro-F1, Prec = macro-precision, Rec = macro-recall (all \%).} \label{tab:ablation} \setlength{\tabcolsep}{5pt} \begin{tabular}{@{}lcccc@{}} \toprule \textbf{System} & \textbf{Acc.} & \textbf{F1} & \textbf{Prec.} & \textbf{Rec.} \\ \midrule TF-IDF & 54.50 & 54.44 & 54.49 & 54.47 \\ SBERT MiniLM & 55.00 & 54.93 & 54.99 & 54.97 \\ Paraphrase MPNet & 59.00 & 58.85 & 59.05 & 58.95 \\ Cross-Enc. RoBERTa & 55.50 & 55.49 & 55.49 & 55.49 \\ \textbf{SBERT MPNet} & \textbf{61.50} & \textbf{61.48} & \textbf{61.50} & \textbf{61.48} \\ \bottomrule \end{tabular} {\footnotesize Test submission: SBERT MiniLM (61.50\%, rank 33/44).} \end{table} \texttt{all-mpnet-base-v2} achieves the highest development-set accuracy at 61.50\%, with near-perfect agreement between precision and recall (both 61.50\%), indicating balanced predictions across both classes. The paraphrase-focused MPNet model (59.00\%) ranks second, confirming that models trained to handle paraphrastic variation offer an advantage for narrative similarity. The most striking finding is the underperformance of the cross-encoder (55.50\%), which barely surpasses the near-chance results of TF-IDF (54.50\%) and SBERT MiniLM (55.00\%). We attribute this to the long-document nature of Wikipedia story summaries: cross-encoders such as \texttt{cross-encoder/stsb-roberta-large} are fine-tuned on short sentence-pair benchmarks (STS-B), and their fixed maximum token length causes truncation of the long narrative inputs. Bi-encoders, which encode each text independently, are not subject to the same joint context-length constraint and can represent entire summaries as a single pooled vector. This finding is consistent with prior work showing that cross-encoder advantages diminish or reverse when inputs substantially exceed training-time length distributions~\cite{nogueira2019passage}. The near-random performance of TF-IDF (54.50\%) and SBERT MiniLM (55.00\%) on the development set---close to the 50\% chance baseline for a binary task---underscores how challenging this dataset is. The task organisers deliberately filtered for difficult cases with low inter-annotator agreement ($\alpha = 0.33$)~\cite{hatzel2026semeval}, meaning even strong models struggle. \subsection{Official Test Results} On the official Track A test set, our submitted SBERT MiniLM system achieved 61.50\%, placing 33rd among 44 participating teams~\cite{hatzel2026semeval}. The top system (COGNAC) achieved 78.00\% using LLM-based voting ensembles. Notably, our best development-set system (\texttt{all-mpnet-base-v2}, 61.50\%) achieves the same accuracy as our official submission, suggesting that the difficulty of the task is consistent across splits. A submission using \texttt{all-mpnet-base-v2} would likely have achieved a higher ranking. \subsection{Distribution Analysis} Figure~\ref{fig:distribution} shows the distribution of cosine similarity scores for the best-performing SBERT MPNet model. The overlap region between the selected (correct) and rejected (distractor) score distributions reflects the genuine difficulty of the task: many instances produce near-identical similarity scores for both candidates, corresponding to the hardest narrative comparison cases. \begin{figure}[htbp] \centering \includegraphics[width=\linewidth]{similarity_distribution.png} \caption{Distribution of cosine similarity scores for \texttt{all-mpnet-base-v2}. Green: selected (correct) candidate; Red: rejected (distractor). The large overlap region reflects the task's inherent difficulty.} \label{fig:distribution} \end{figure} \subsection{Systematic Error Analysis} \label{sec:error} We categorise all 200 development instances into six mutually exclusive categories based on system agreement and correctness. Five of these are failure-oriented categories, while one corresponds to All Correct / Partial outcomes. Table~\ref{tab:error_cats} summarises the distribution; the proposed system refers to \texttt{all-mpnet-base-v2} (best dev system) and SBERT refers to \texttt{all-MiniLM-L6-v2} (official baseline). \begin{table}[h] \centering \small \caption{Error category distribution on the development set ($n=200$). ``Proposed'' = \texttt{all-mpnet-base-v2}; ``SBERT'' = \texttt{all-MiniLM-L6-v2}.} \label{tab:error_cats} \begin{tabular}{p{0.62\linewidth}cc} \toprule \textbf{Category} & \textbf{Count} & \textbf{\%} \\ \midrule All Correct / Partial & 85 & 42.5 \\ Proposed-Only Error & 26 & 13.0 \\ Proposed-Recovery (TF-IDF + SBERT fail) & 24 & 12.0 \\ Hard Case (all systems fail) & 23 & 11.5 \\ Lexical Trap (TF-IDF fails, neural correct)& 23 & 11.5 \\ Neural Failure (TF-IDF correct, neural wrong) & 19 & 9.5 \\ \bottomrule \end{tabular} \end{table} Four categories reveal distinct failure modes: \begin{enumerate} \item \textbf{Lexical Traps (11.5\%):} TF-IDF incorrect, neural systems correct. These 23 cases involve surface-level distractors---shared dates, character descriptions, or setting terms---that inflate lexical similarity scores for the wrong candidate. The correct narrative match shares thematic and structural properties that are invisible to bag-of-words representations. \item \textbf{Neural Failures (9.5\%):} TF-IDF correct, both neural systems incorrect. In these 19 cases, lexical overlap is a genuinely reliable signal, but dense embeddings introduce spurious semantic associations. These cases represent the irreducible advantage of lexical approaches on certain surface-transparent instances. \item \textbf{Proposed-Recovery (12.0\%):} Both TF-IDF and SBERT MiniLM fail, but \texttt{all-mpnet-base-v2} succeeds. These 24 cases show the value of a higher-capacity bi-encoder. In effect, MPNet resolves 12\% of the development set that both comparison systems miss. \item \textbf{Hard Cases (11.5\%):} All systems incorrect. These 23 instances represent genuinely ambiguous narrative comparisons, consistent with the dataset's low inter-annotator agreement~\cite{hatzel2026semeval}. No embedding-based system is likely to resolve these without deeper narrative reasoning. \end{enumerate} The \textbf{Proposed-Only Errors (13.0\%)} also merit attention. In 26 instances, \texttt{all-mpnet-base-v2} is wrong while TF-IDF and SBERT are correct. This is slightly more frequent than the Proposed-Recovery cases (24). The pattern suggests that higher capacity helps in some cases but also introduces new error modes, likely through overconfident semantic generalisation. Figure~\ref{fig:error_distribution} visualises the category counts from our error analysis. The distribution highlights that the largest share of the development set consists of cases where at least one system succeeds, but it also shows a substantial block of genuinely difficult or model-specific failures. \begin{figure}[htbp] \centering \includegraphics[width=\linewidth]{error_distribution.png} \caption{Distribution of error-analysis categories on the 200-instance development set. The chart highlights both recovery cases for the proposed model and persistent hard cases shared across systems.} \label{fig:error_distribution} \end{figure} Table~\ref{tab:qualitative} presents a representative Lexical Trap case drawn directly from the development set. \begin{table*}[htbp] \centering \small \renewcommand{\arraystretch}{1.3} \begin{tabular}{p{0.3\linewidth} p{0.3\linewidth} p{0.3\linewidth}} \toprule \textbf{Anchor Story} & \textbf{TF-IDF Prediction (Incorrect)} & \textbf{SBERT MPNet Prediction (Correct)} \\ \midrule Dave Anderson and Manny Durrell are two high-class sneak thieves who have never been caught. [Gold: Option A] & \textbf{Option B:} As the film opens Ahmad, a grade schooler, watches as his teacher is being harassed... \newline \textit{(Error: TF-IDF was misled by surface tokens unrelated to the heist narrative theme.)} & \textbf{Option A:} The Great Depression is over. King of the con men Fargo Gondorf has been released from prison and is drawn back into one last great con... \newline \textit{(Success: MPNet captures the shared crime-partnership theme.)} \\ \bottomrule \end{tabular} \caption{A Lexical Trap case from the development set. TF-IDF is misled by surface overlap, whereas \texttt{all-mpnet-base-v2} correctly identifies the deeper thematic match: a partnership-based crime narrative.} \label{tab:qualitative} \end{table*} % ========================================== % 5. CONCLUSION % ========================================== \section{Conclusion} Our participation in SemEval-2026 Task 4 confirms that narrative similarity is a challenging semantic task that resists simple keyword-based solutions. Through a five-system ablation on 200 development instances, we find that \texttt{all-mpnet-base-v2} (61.50\% accuracy, 61.48 macro-F1) is the strongest approach among those evaluated, outperforming TF-IDF (54.50\%), the official SBERT MiniLM baseline (55.00\%), and a paraphrase bi-encoder (59.00\%). Notably, a cross-encoder re-ranker (55.50\%) does not improve on the bi-encoders---a finding we attribute to the long-document nature of Wikipedia story summaries, which causes truncation in joint-encoding architectures fine-tuned on short sentence pairs. Our official test submission ranked 33rd of 44 teams (61.50\%), with the top system achieving 78.00\% via LLM-based ensembles. Our error analysis identifies five systematic categories across 200 instances: 23 Lexical Traps (11.5\%) where TF-IDF fails on semantically equivalent but lexically distinct narratives; 19 Neural Failures (9.5\%) where dense embeddings introduce spurious associations; 24 Proposed-Recovery cases (12.0\%) demonstrating the concrete gain of higher-capacity bi-encoders; 23 Hard Cases (11.5\%) representing the dataset's inherent ambiguity ceiling; and 26 Proposed-Only Errors (13.0\%) revealing that increased model capacity also introduces new failure modes. Future work should explore: (1) narrative-specific bi-encoder fine-tuning on story-similarity data~\cite{hatzel2024story}, which may close the gap with LLM-based systems at lower computational cost; (2) long-document-aware cross-encoders (e.g.\ Longformer-based) that can handle full story summaries without truncation; and (3) structured prompting of LLMs with explicit narrative decomposition (theme, events, outcomes) as demonstrated by the top systems at this shared task. % ========================================== % REFERENCES % ========================================== \bibliography{references} \end{document}