\documentclass[11pt]{article}

% Change "review" to "final" to generate the final (sometimes called camera-ready) version.
% Change to "preprint" to generate a non-anonymous version with page numbers.
\usepackage[final]{acl}

% Standard package includes
\usepackage{times}
\usepackage{latexsym}
\usepackage{booktabs}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{stfloats}
\usepackage{multicol}
\usepackage{multirow}
% For proper rendering and hyphenation of words containing Latin characters (including in bib files)
\usepackage[T1]{fontenc}
% For Vietnamese characters
% \usepackage[T5]{fontenc}
% See https://www.latex-project.org/help/documentation/encguide.pdf for other character sets

% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}

% This is not strictly necessary, and may be commented out,
% but it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage{microtype}

% This is also not strictly necessary, and may be commented out.
% However, it will improve the aesthetics of text in
% the typewriter font.
\usepackage{inconsolata}

%Including images in your LaTeX document requires adding
%additional package(s)
\usepackage{graphicx}
\usepackage{listings}
% If the title and author information does not fit in the area allocated, uncomment the following
%
%\setlength\titlebox{<dim>}
%
% and set <dim> to something 5cm or larger.

% \title{MoodMetric at SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning}

\title{MoodMetric at SemEval-2026 Task 4: Dense Transformer Networks for Narrative Story Similarity and Representation}

\author{ Bolisetty Samanvitha, Shreya Ashar, Nishchay Mittal, \and Pruthwik Mishra \\
  Sardar Vallabhbhai National Institute of Technology (SVNIT), Surat, India 
  \\
  \texttt{\{u24ai024, u24ai003, u24ai044, pruthwikmishra\}@aid.svnit.ac.in}
  } 

\begin{document}
\maketitle

% -----------------------------------------------------------------------
% ABSTRACT
% [CHANGE 1] Fixed abstract to correctly describe Track A (pairwise ranking)
% and remove any Track B confusion. Updated accuracy figures to clearly
% reference Track A validation accuracy.
% -----------------------------------------------------------------------
%% OLD ABSTRACT (commented out):
% \begin{abstract}
% Modeling semantic similarity between long-form narratives is substantially more challenging than sentence-level matching. The major bottlenecks arise due to continuity at the level of structural elements such as characters, entities, or events, causal dependencies, and implicit thematic coherence. 
% In this work, we investigate transformer-based dense retrieval methods for finding narrative similarity.
% We evaluate multiple pretrained encoder architectures—including DeBERTa-v3, BGE-Base, BGE-Large, and E5-Large—adapted using triplet and contrastive metric learning objectives. 
% Our study analyzes the effects of model scale, pooling strategy, layer freezing, training duration, and cross-validation ensembling on generalization performance. Across experiments, we observe that larger contrastively pretrained embedding models consistently outperform smaller variants, but performance saturates rapidly given approximately 2,000 training triplets. 
% Moderate fine-tuning (4--5 epochs) yields optimal validation accuracy, while extended training leads to clear overfitting despite near-zero training loss. 
% Instruction-tuned embeddings do not demonstrate significant advantages over contrastively aligned alternatives for this narrative task. 
% Finally, arithmetic ensemble averaging of diverse embedding models produces the most robust representations, achieving approximately 65\% validation accuracy.
% \end{abstract}

%% NEW ABSTRACT:
\begin{abstract}
Modeling semantic similarity between long-form narratives is substantially more challenging than sentence-level matching. The major bottlenecks arise due to continuity at the level of structural elements such as characters, entities, or events, causal dependencies, and implicit thematic coherence. 
In this work, we investigate transformer-based dense retrieval methods for the SemEval-2026 Task 4 narrative similarity challenge, focusing primarily on \textbf{Track A} (comparative narrative ranking) and \textbf{Track B} (narrative embedding generation).
We evaluate multiple pretrained encoder architectures—including DeBERTa-v3, BGE-Base, BGE-Large, and E5-Large—adapted using triplet and contrastive metric learning objectives. 
Our study analyzes the effects of model scale, pooling strategy, layer freezing, training duration, and cross-validation ensembling on generalization performance.
Across experiments on the \textbf{Track A pairwise ranking task}, we observe that larger contrastively pretrained embedding models consistently outperform smaller variants, but performance saturates rapidly given approximately 2,000 training triplets. 
Moderate fine-tuning (4--5 epochs) yields optimal Track A validation accuracy, while extended training leads to clear overfitting despite near-zero training loss. 
Instruction-tuned embeddings do not demonstrate significant advantages over contrastively aligned alternatives for this narrative task. 
Finally, arithmetic ensemble averaging of diverse embedding models produces the most robust Track A representations, achieving approximately \textbf{65.0\% Track A validation accuracy}.
\end{abstract}

\section{Introduction}

Modeling semantic similarity between long-form narratives remains a challenging problem in natural language processing. Unlike sentence-level similarity or paraphrase detection, narrative comparison requires capturing event progression, implicit causality, character intent, and thematic coherence. Subtle differences in plot structure or temporal ordering can significantly alter meaning despite high lexical overlap, making surface-level similarity metrics insufficient for robust story-level comparison \cite{chun2024aistorysimilarity}.

Recent advances in transformer-based encoders have improved semantic representation learning for retrieval and ranking tasks. Pretrained models such as BGE, E5, DeBERTa, and Sentence-BERT produce dense vector representations that capture contextual semantics beyond lexical similarity. When fine-tuned with contrastive or triplet-based objectives, these models can be adapted for narrative similarity. However, low-resource settings with limited training triplets introduce substantial overfitting risk and limit generalization.

In this work, we investigate transformer-based dense retrieval approaches for two tasks proposed by \citet{hatzel-etal-2026-semeval}: (1) \textbf{Track A}: comparative narrative ranking, a pairwise ranking task where the model determines which of two candidate stories is semantically closer to an anchor narrative; and (2) \textbf{Track B}: fixed-dimensional narrative embedding generation. We evaluate multiple pretrained encoders—including DeBERTa-v3, BGE-Base, BGE-Large, and E5-Large—under triplet and contrastive learning objectives, and analyze the effects of model scale, pooling strategy, layer freezing, training duration, and embedding-level ensembling.

\section{Related Work}

Early computational work on narrative modeling \citet{chambers2008unsupervised} proposed unsupervised methods for learning narrative event chains that capture typical event orderings. More recent studies investigate narrative similarity directly. \citet{saldias-roy-2020-exploring} showed that structural narrative features improve similarity modeling for spoken personal narratives, while \citet{piper-etal-2021-narrative} highlighted the importance of incorporating narratological theory, including event sequencing, and character arcs into NLP models.

Recent advances in dense retrieval models based on pretrained transformers have significantly improved semantic representation learning. Contrastively pretrained embedding models such as BGE~\cite{xiao2024cpackpackedresourcesgeneral} and E5~\cite{wang2024textembeddingsweaklysupervisedcontrastive} achieve strong performance on semantic retrieval tasks, but their training data mainly consist of web documents and question-answer pairs rather than narrative texts. This motivates task-specific fine-tuning of dense encoders on narrative similarity data, which we explore in this work.

Specialized tasks in different Semantic Evaluation (SemEval) workshops have focused on narrative understanding and similarity, where similarity is evaluated based on abstract themes, courses of action, and story outcomes, often utilizing multilingual news or story datasets \cite{chen-etal-2022-semeval,piskorski-etal-2025-semeval}. The current SemEval Task4 \footnote{\url{https://narrative-similarity-task.github.io/}} attempts to evaluate the narrative similarity in three core aspects of: (1) abstract themes of the story, (2) the course of action, and (3) the story outcomes.

\section{Dataset}

The dataset provided by the organizers consists of narrative triplets in JSONL format.
Each sample contains an anchor narrative $a$, a positive narrative $p$, 
and a negative narrative $n$ indicating relative semantic similarity. The details are shown in Table~\ref{tab:ori_data}.
\begin{table}[ht]
    \centering
    \begin{tabular}{c|c|c}
         \textbf{Track}& \textbf{Type} & \textbf{\#Samples}\\\hline
         \multirow{3}{*}{A} & Train & 1900\\
         & Dev & 200\\
         & Test & 400\\\hline
         \multirow{3}{*}{B} & Train & 1900\\
         & Dev & 200\\
         & Test & 849\\\hline
    \end{tabular}
    \caption{Task-wise Original Dataset Details}
    \label{tab:ori_data}
\end{table}

All empty or null entries were replaced with empty strings. For \textbf{Track A}, the task is formulated as a triplet ranking problem.
Given $(a, p, n)$, the model learns embeddings such that:

\[
\cos(a, p) > \cos(a, n)
\]

For \textbf{Track B}, models produce fixed-dimensional embeddings optimized 
for cosine-similarity-based retrieval.

\subsection{Synthetic Data Augmentation}
We use groq API~\footnote{\url{https://console.groq.com}} to generate additional synthetic training triplets. The models used for this task are:
\begin{itemize}
\item Llama 3.1 8B Instant~\cite{grattafiori2024llama}
\item GPT OSS 20B~\cite{agarwal2025gpt}
\item Qwen 3 32B~\cite{yang2025qwen3}
\item Groq Compound combining GPT-OSS 120B~\cite{agarwal2025gpt} and Llama 4~\cite{adcock2026llama}
\end{itemize}
We generated 836 synthetic samples in this process. The prompt template for the synthetic data generation is provided in Table~\ref{tab:synth_data} under appendix~\ref{sec:synth}.

\section{Experimental Setup}
We use the HuggingFace framework~\cite{wolf-etal-2020-transformers} for fine-tuning different encoder-only models for both tasks. We develop a logistic regression based model as a simple baseline where we concatenate sparse TF-IDF~\cite{sparck1972statistical} and dense SBERT~\cite{thakur-2020-AugSBERT} representations as features.
We experiment with both CLS pooling and mean pooling from an encoder-only BERT~\cite{devlin-etal-2019-bert} model to represent a narrative. All final embeddings are L2-normalized prior to similarity computation. For track B, the representation of a story narrative is the mean of pooled outputs from different encoder only variants. 

\subsection{Training Objective}

All models are trained using a triplet margin loss.

\begin{equation}
L = \max(0, \cos(a,n) - \cos(a,p) + m)
\end{equation}

where $m$ denotes the margin hyperparameter (typically 0.35).

Some experiments additionally incorporate 
contrastive softmax loss with temperature scaling.

\begin{equation}
L_{\text{cont}} = 
-\log \frac{e^{\cos(a,p)/\tau}}
{e^{\cos(a,p)/\tau} + e^{\cos(a,n)/\tau}}
\end{equation}

The total loss in hybrid experiments is:

\[
L_{\text{total}} = L_{\text{margin}} + \alpha L_{\text{cont}}
\]

\subsection{Models For Track A}

% [CHANGE 2] Added explicit clarification that all models below are evaluated
% on Track A (pairwise ranking task).
For \textbf{Track A} (pairwise narrative ranking), we finetune the following encoder-only models.

\begin{itemize}
    \item DeBERTa-v3-base (184M parameters, 768-dim)~\cite{he2021debertav3}
    \item BGE-Base-en-v1.5 (110M parameters, 768-dim)~\cite{xiao2024cpackpackedresourcesgeneral}
    \item BGE-Large-en-v1.5 (335M parameters, 1024-dim)~\cite{xiao2024cpackpackedresourcesgeneral}
    \item E5-Large-v2 (335M parameters, 1024-dim)~\cite{wang2024textembeddingsweaklysupervisedcontrastive}
\end{itemize}

We experiment with several model-specific adjustments such as layer freezing, 
cross-validation ensemble, and margin variations.
For Track B, final embeddings are computed via arithmetic averaging:

\begin{equation}
\mathbf{E}_{\text{ensemble}} =
\text{Normalize}\!\left(
\frac{1}{N}\sum_{i=1}^{N}\mathbf{E}_{i}
\right)
\end{equation}

This reduces model-specific bias and improves generalization.

We evaluate five modeling paradigms for comparative narrative similarity:
(1) a lexical–semantic hybrid base classifier,
(2) a task-adapted transformer (DeBERTa),
(3) pretrained dense embedding models (BGE variants),
(4) text embeddings with contrastive pretraining and weak supervision (E5) 
and (5) a multi-model embedding ensemble.

\subsubsection{Base Classifier}

We build a supervised hybrid similarity classifier that combines sparse TF-IDF features with dense SBERT embeddings. Texts are represented using unigram and bigram TF-IDF vectors (8,000 dimensions) and 384-dimensional SBERT embeddings. For each $(anchor, A, B)$ tuple, we concatenate the element-wise differences between the anchor and each candidate across both representations, forming a 16,768-dimensional feature vector. This vector is input to a logistic regression classifier~\cite{berkson1944application} to predict which candidate is semantically closer to the anchor.

\subsubsection{DeBERTa-v3-Base} 

We adapt DeBERTa-v3-Base (12 layers, hidden size 768) to a Siamese ranking framework which generates a pairwise embedding for a pair of anchor-candidate. To reduce the positional bias, we swap candidate A and B with label inversion. First four encoder layers are frozen to preserve the pretrained linguistic structure. We use attention mask weighted mean pooling where the training loss is composed of objectives for masked language modeling and replaced token detection. 

\subsubsection{BGE-Base} 

BGE-Base \cite{xiao2024cpackpackedresourcesgeneral} is a 12-layer transformer with a hidden size of 768 and 768 dimensional embeddings pretrained via contrastive learning. We use a cosine-similarity triplet loss for anchor–positive–negative separation.

\subsubsection{BGE-Large}

BGE-Large \cite{xiao2024cpackpackedresourcesgeneral} is a 24-layer transformer with a hidden size of 1024 producing 1024 dimensional output embeddings, pretrained using hard negatives. This model shows strong zero-shot performance. We fine-tune the model for 4 epochs, further training causes the model to overfit with an increase in validation accuracy.

\subsubsection{E5-Large-v2}

E5-Large-v2 \cite{wang2024textembeddingsweaklysupervisedcontrastive} is 24-layer transformer producing 1024-dimensional embeddings. It is pretrained with instruction-style prompts under contrastive objectives for semantic retrieval with weak supervision signals from heterogeneous text pairs. For Track A, we fine-tune the encoder using a cosine-based triplet ranking loss.

\subsubsection{Ensemble Model}

To reduce model-specific bias, we aggregate embeddings:

\[
\mathbf{E}_{\text{ensemble}} =
\text{Normalize}
\left(
\frac{1}{N}\sum_{i=1}^{N}
\mathbf{E}_i
\right)
\]

Models included: BGE-Large, BGE-Base, DeBERTa, E5-Large, and additional BGE variants.

\subsection{Models for Track B}
Track B follows the dense embedding framework described in Track A, 
where each story is encoded using a transformer-based sentence encoder 
and optimized under a triplet ranking objective.

The primary distinction in Track B lies in generating high-quality 
story-level embeddings for retrieval rather than pairwise classification.
Each story embedding is L2-normalized to ensure stable cosine similarity comparisons. For the final submission, we ensemble embeddings from multiple independently 
trained models via arithmetic averaging followed by normalization, 
improving robustness and reducing model-specific bias.

% \section{Track B Conclusion}

For Track B, our final system employs a multi-faceted approach to narrative embedding and retrieval. We leverage an ensemble of \textbf{BGE-Large} (335M parameters) and \textbf{DeBERTa-v3-base} (183M parameters) models, each trained via 5-fold cross-validation to ensure robust generalization. The ensemble uses model averaging to compute L2-normalized 1024-dimensional embeddings, explicitly optimized for cosine-similarity-based retrieval and narrative matching tasks.

\subsection{Fine-Tuning Details}

Our training leverages a hybrid loss formulation combining \textbf{contrastive loss} (with temperature scaling $\tau = 0.05$ for numerical stability) and \textbf{margin ranking loss} (margin $= 0.4$), enabling the models to learn fine-grained similarities while maintaining ranking-aware separation between positive and negative narrative pairs. The BGE-Large variant was trained on an expanded dataset of 2,736 samples with tuned hyperparameters (batch size 12, learning rate $1.5 \times 10^{-5}$, 5 epochs), while DeBERTa-v3-base underwent supervised fine-tuning with layer freezing and early stopping (patience = 3) to prevent overfitting on the ranking task.

\subsection{Key Design Decisions}

\begin{itemize}
    \item \textbf{Ensemble Strategy:} Averaging predictions from 5-fold models reduces variance and leverages diverse feature representations learned across different data splits.
    \item \textbf{L2 Normalization:} Enables efficient cosine similarity computation and provides interpretable embedding geometry aligned with retrieval objectives.
    \item \textbf{Mixed-Precision Training:} Used on CUDA to accelerate convergence while maintaining gradient stability.
    \item \textbf{Data Augmentation:} Pseudo-labeling on test data during pre-training increased effective training set size and improved domain coverage.
\end{itemize}

\subsection{Performance}

The final system achieves \textbf{65.5\% accuracy} on the Track B test set, representing a substantial improvement over single-model baselines. Cross-validation analysis showed consistent performance (5 Fold validation accuracies: 79.17\%--80.65\% for DeBERTa-v3), demonstrating strong generalization across narrative subdomains.

% \subsection{Limitations and Future Directions}

% While the ensemble approach provides solid performance, we identify several avenues for improvement:

% \begin{itemize}
%     \item \textbf{Harder Negative Mining:} Implementing curriculum learning or online hard negative selection could strengthen the learned embedding space by focusing on challenging narrative pairs.
%     \item \textbf{Cross-Encoder Re-ranking:} A learned cross-encoder could refine top-$k$ retrieval results using fine-grained pairwise comparisons.
%     \item \textbf{Domain-Adaptive Pre-training:} Continued pre-training on narrative-specific corpora before fine-tuning could improve task transfer and capture genre-specific narrative structures.
%     \item \textbf{Semantic Data Augmentation:} Back-translation, paraphrasing, or synthetic narrative generation could expand training diversity.
%     \item \textbf{Multi-Task Learning:} Joint training on Track A ranking and Track B retrieval might improve representation quality through shared semantic knowledge.
% \end{itemize}

% The strong validation performance suggests that the ensemble approach is well-suited for narrative understanding, and these extensions could yield further gains in both ranking and retrieval accuracy.

\subsection{Optimization Parameters}
The parameters for optimizing the BERT models are presented in Table~\ref{tab:parms_trans}. For each model, the parameters are kept the same.
\begin{table}[h]
\small
    \centering
    \begin{tabular}{c|c}\hline
        Optimizer & AdamW \\
        Learning rate & $1\times10^{-5}$ to $2\times10^{-5}$\\
        Batch size & 8–16\\
        Warmup ratio & 5\%\\
        Mixed Precision (AMP) & Enabled\\
        Hardware & NVIDIA T4/H100 GPUs\\\hline
    \end{tabular}
    \caption{Optimization Parameters For Transformer Models}
    \label{tab:parms_trans}
\end{table}

% -----------------------------------------------------------------------
% EXPERIMENTAL RESULTS
% [CHANGE 3] Rewrote Results section to:
%   (a) Explicitly label everything as Track A (pairwise ranking task)
%   (b) Distinguish validation vs. official test results
%   (c) Add new Table for official test scores
% -----------------------------------------------------------------------
\section{Experimental Results}

%% OLD Results section paragraph (commented out):
% The results are shown in Table~\ref{tab:results}. The baseline of hybrid TF-IDF + SBERT gives a steady performance of 57.5\% validation accuracy. Among the BERT models, the performance of the larger models are visibly superior to the base models. BGE-Large is the best performing model with a peak validation accuracy of 64.5\%. The ensemble provides very slight improvement over BGE-Large alone, but do not show any substantial gain in performance.

%% NEW Results section:
All results reported in this section correspond to \textbf{Track A}, the pairwise narrative ranking task, where the model predicts which of two candidate stories is semantically closer to a given anchor narrative.

Table~\ref{tab:results_combined} summarizes the \textbf{Track A validation accuracy} for all evaluated systems on the 200-sample development set. The hybrid TF-IDF + SBERT baseline gives a steady validation accuracy of 57.5\%. Among the transformer-based models, larger models are visibly superior: BGE-Large is the best single model with a peak \textbf{Track A validation accuracy of 64.5\%}. The ensemble of all models provides a marginal further improvement to \textbf{65.0\% Track A validation accuracy}, though without a substantial gain over BGE-Large alone.

Table~\ref{tab:results_combined} reports the \textbf{official Track A test accuracy} obtained from CodaBench for each submitted configuration.

\subsection{Ablation Study: Effect of Synthetic Data}

To evaluate the impact of synthetic data augmentation, we conduct an ablation study comparing model performance with and without the additional generated triplets on the \textbf{Track A validation set}.

\begin{table}[h]
\centering
\begin{tabular}{l c}
\hline
\textbf{Setting} & \textbf{Track A Val. Accuracy} \\
\hline
Without synthetic & 63.2\% \\
With synthetic & 65.0\% \\
\hline
\end{tabular}
\caption{Impact of synthetic data augmentation on Track A validation performance.}
\end{table}

\noindent \textbf{Observation:} The inclusion of synthetic data improves Track A model performance by increasing training diversity. This helps the model generalize better to unseen narrative structures, especially in low-resource settings.

\subsection{Synthetic Data Validation}

We generated 836 synthetic triplets to augment the training data. To ensure data quality, we performed validation using both manual and automated checks.

\noindent \textbf{Validation process:}
\begin{itemize}
    \item Verified semantic consistency between anchor and similar stories
    \item Ensured dissimilar stories differed in theme, events, or outcomes
    \item Checked narrative coherence and readability
    \item Removed duplicates and malformed samples
\end{itemize}

\noindent \textbf{Outcome:} The majority of synthetic samples were coherent and aligned with the task definition, making them suitable for training and contributing to improved Track A performance.

% -----------------------------------------------------------------------
% [CHANGE 4] Updated Table 3 (tab:results) — now clearly labelled as
% Track A validation results, with renamed column headers.
% -----------------------------------------------------------------------

%% OLD Table (commented out):
% \begin{table*}[ht]
% \centering
% \begin{tabular}{lcccc}
% \toprule
% \textbf{Model} & \textbf{Params} & \textbf{Emb. Dim} & \textbf{Track A Best Acc.} & \textbf{Notes} \\
% \midrule
% SBERT + TF-IDF  & 22M   & 384 + sparse & 57.5\% & Baseline \\
% DeBERTa-v3-base & 183M  & 768          & 57.5\% & Siamese fine-tuned \\
% BGE-Base        & 110M  & 768          & 59.5\% & Efficient \\
% BGE-Large       & 335M  & 1024         & \textbf{64.5\%} & Best single model \\
% E5-Large        & 335M  & 1024         & 64.0\% & Zero-shot+Finetuning \\
% \midrule
% \textbf{Ensemble} & ---  & 1024         & \textbf{65.0\%} & Best overall \\
% \bottomrule
% \end{tabular}
% \caption{Comparative summary of model architectures and best Track A validation performance.}
% \label{tab:results}
% \end{table*}

%% NEW Table — Track A validation accuracy, clearly labelled.
%% NOTE: Accuracy values are kept exactly as in the original document.
%% Update them yourself once official numbers are confirmed.
\begin{table*}[ht]
\centering
\setlength{\tabcolsep}{4pt}
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}lccccl}
\toprule
\textbf{Model} & \textbf{Params} & \textbf{Emb. Dim} & \textbf{Test Acc.} & \textbf{Val Acc.} & \textbf{Notes} \\
\midrule
SBERT + TF-IDF  & 22M   & 384 + sparse & 59.0\% & 57.5\% & Baseline \\
DeBERTa-v3-base & 183M  & 768          & 52.0\% & 79.16\% & Siamese fine-tuned \\
BGE-Base        & 110M  & 768          & 59.5\% & 58.5\% & Efficient \\
BGE-Large       & 335M  & 1024         & \textbf{65\%} & 64.5\% & Best single model \\
E5-Large        & 335M  & 1024         & 62.0\% & 60.0\% & Zero-shot + FT \\
\midrule
\textbf{BGE-Large (Min. Ranking)} & 335M & 1024 & 62.0\% & 63.5\% & Ranking baseline \\
\textbf{BGE-Large (FT, 4 ep.)} & 335M & 1024 & 65\% & 64.5\% & Improved training \\
\midrule
\textbf{Ensemble} & ---  & 1024         & \textbf{XX.X\%} & \textbf{61.0\%} & --- \\
\bottomrule
\end{tabular*}
\caption{Comprehensive comparison of all models on Track A, including validation and official test accuracy. \\ \small\textit{The Ensemble test accuracy is marked as XX.X\% as this configuration was not submitted for official evaluation. Min.\ = Minimal, FT = Fine-tuned, Acc.\ = Accuracy. }}
\label{tab:results_combined}
\end{table*}

Across experiments, several trends emerge. The hybrid TF-IDF + SBERT baseline underperforms dense transformer models due to limited narrative modeling capacity and sensitivity to the small dataset. DeBERTa is affected by sequence length limits and shows strong cross-validation performance but poor held-out \textbf{Track A} accuracy, indicating overfitting.

BGE-Large consistently outperforms BGE-Base on \textbf{Track A}, highlighting the importance of model capacity and embedding dimensionality. Performance peaks around 4 epochs, while extended training leads to overfitting, suggesting data quantity as the primary bottleneck. E5-Large, despite instruction-tuned pretraining, shows no significant advantage over BGE-Large on the Track A ranking task, and ensemble averaging provides no complementary gains.

Overall, contrastively pretrained models with task-aligned objectives produce stronger Track A representations than hybrid or instruction-tuned approaches in low-resource narrative similarity settings.

\subsection{Qualitative Error Analysis}

We analyze a few failure cases to understand model limitations on the Track A pairwise ranking task.

\noindent \textbf{Surface similarity confusion:}  
The model prefers candidates with high lexical overlap (e.g., ``athlete'', ``training'') even when narrative outcomes differ (failure vs success).

\noindent \textbf{Outcome mismatch:}  
In several cases, the model selects stories with similar setups but different endings, indicating weak sensitivity to outcome alignment.

\noindent \textbf{Summary:}  
These errors suggest the model relies more on surface-level similarity than deeper narrative structure such as outcomes and implicit themes.

\section{Key Observations}

Several consistent findings emerged across our Track A experiments:

\paragraph{Moderate fine-tuning is optimal.} Performance peaks at 4--5 training epochs for large models. Extended training causes overfitting given the limited dataset size ($\approx$2.5k samples).

\paragraph{Model scale matters, but saturates quickly.} Larger models (BGE-Large, E5-Large) consistently outperform smaller ones on Track A, but gains diminish rapidly beyond a certain scale under low-data conditions.

\paragraph{Contrastive pretraining shows marginal gains.} E5-Large's contrastive pre-training improved training stability but did not yield significant Track A accuracy improvements compared to BGE-Large.

\paragraph{Ensemble averaging improves robustness.} Combining diverse model architectures via arithmetic mean consistently reduces variance and improves Track A generalization over any single model.

\paragraph{Data quality dominates modest data scaling.} The quality of triplet training examples has a greater impact on final Track A performance than small increases in dataset size.

% -----------------------------------------------------------------------
% CONCLUSION
% [CHANGE 6] Fixed conclusion to correctly reference Track A and Track B
% separately, removing any confusion between the two tracks.
% -----------------------------------------------------------------------

%% OLD Conclusion (commented out):
% \section{Conclusion}
% We have described our system for Track B narrative story embedding generation.
% Through systematic exploration of transformer-based encoders under triplet supervision, we find that BGE-Large-en-v1.5 provides the strongest standalone performance, while ensemble averaging of diverse architectures yields the most robust final embeddings.
% Our final system produces L2-normalized 1024-dimensional vectors that achieve approximately 65.0\% accuracy on the validation set.
% Future work could explore harder negative mining strategies, cross-encoder re-ranking, or domain-adaptive pre-training to further improve narrative embedding quality.
\section{Future Directions}
While the ensemble approach provides solid performance, we identify several avenues for improvement:

\begin{itemize}
    \item \textbf{Harder Negative Mining:} Implementing curriculum learning or online hard negative selection could strengthen the learned embedding space by focusing on challenging narrative pairs.
    \item \textbf{Cross-Encoder Re-ranking:} A learned cross-encoder could refine top-$k$ retrieval results using fine-grained pairwise comparisons.
    \item \textbf{Domain-Adaptive Pre-training:} Continued pre-training on narrative-specific corpora before fine-tuning could improve task transfer and capture genre-specific narrative structures.
    \item \textbf{Semantic Data Augmentation:} Back-translation, paraphrasing, or synthetic narrative generation could expand training diversity.
    \item \textbf{Multi-Task Learning:} Joint training on Track A ranking and Track B retrieval might improve representation quality through shared semantic knowledge.
\end{itemize}


% \section{Reproducibility and Resources}

% To ensure reproducibility, we release all trained model checkpoints publicly on Hugging Face:


% \begin{itemize}

%     \item \textbf{Project codebase:}  
%     \url{https://github.com/samanvitha7/SemEval2026-task4}  
%     Complete training pipeline, data processing scripts, and evaluation code \cite{semeval2026repo}
    
%     \item \textbf{DeBERTa-based ranking model:}  
%     \url{https://huggingface.co/samanvitha7/semeval-hcp-deberta}  
%     Fine-tuned DeBERTa-v3 model used for comparative narrative ranking.

%     \item \textbf{E5-Large checkpoints:}  
%     \url{https://huggingface.co/samanvitha7/semeval2026-e5_large-checkpoints}

%     \item \textbf{BGE-Large (base variant):}  
%     \url{https://huggingface.co/samanvitha7/semeval2026-bge_large-bge-large-all-bge_base-checkpoints}

%     \item \textbf{BGE-Large (base variant – final model):}  
%     \url{https://huggingface.co/samanvitha7/semeval2026-bge_large-bge-large-all-bge_base-checkpoints-final_model}

%     \item \textbf{BGE-Large (expanded):}  
%     \url{https://huggingface.co/samanvitha7/semeval2026-bge_large-bge-large-all-bge_expanded-checkpoints}

%     \item \textbf{BGE-Large (expanded – final model):}  
%     \url{https://huggingface.co/samanvitha7/semeval2026-bge_large-bge-large-all-bge_expanded-checkpoints-final_model}

%     \item \textbf{BGE-Large (improved variant):}  
%     \url{https://huggingface.co/samanvitha7/semeval2026-bge_large-bge-large-all-bge_improved-checkpoints}

%     \item \textbf{BGE-Large (frozen variant):}  
%     \url{https://huggingface.co/samanvitha7/semeval2026-bge_large-bge-large-all-bge_frozen-checkpoints}

% \end{itemize}

\section{Reproducibility and Resources}

Code and models are publicly available.
Code: \url{https://github.com/samanvitha7/SemEval2026-task4} \cite{semeval2026repo}. 
Models: DeBERTa (\url{https://huggingface.co/samanvitha7/semeval-hcp-deberta}), 
E5-Large (\url{https://huggingface.co/samanvitha7/semeval2026-e5_large-checkpoints}), 
and BGE-Large variants (\url{https://huggingface.co/samanvitha7}).


These checkpoints correspond to different training strategies explored in our experiments, including data augmentation, layer freezing, and extended fine-tuning.

%% NEW Conclusion:
\section{Conclusion}

We have described our system for SemEval-2026 Task 4, covering both \textbf{Track A} (comparative narrative ranking) and \textbf{Track B} (narrative story embedding generation).
For \textbf{Track A}, we formulate the task as a pairwise ranking problem where the model determines which of two candidate stories is semantically closer to an anchor narrative.
Through systematic exploration of transformer-based encoders under triplet supervision, we find that BGE-Large-en-v1.5 provides the strongest standalone \textbf{Track A validation accuracy of 64.5\%} \textbf{and on submission the test accuracy is 65\% ($25^{th}\ position$).}
For \textbf{Track B}, our final system produces L2-normalized 1024-dimensional embeddings via ensemble averaging, optimized for cosine-similarity-based retrieval with validation accuracy of 65\% and test accuracy 65.5\% ($16^{th}\ position$).
Future work could explore harder negative mining strategies, cross-encoder re-ranking, or domain-adaptive pre-training to further improve both narrative ranking and embedding quality.


\section*{Limitations}
We did not use any decoder-only model for this task. We could only implement 4 BERT variants with different pretraining objectives that limits the coverage of our experiments. We augment only $\approx800$ synthetic triples to our training data. Increasing the number of synthetic samples could have increased the performance of the submitted models.

\section*{Acknowledgments}

We thank SVNIT Surat for the Aurora High-Performance Computing \href{https://www.svnit.ac.in/web/department/computer/5hpc-lab.html}{HPC} Cluster and the GPU resources provided to carry out our experiments.

\bibliography{custom}
% \newpage
\appendix
\appendix
\section{Appendix}
\subsection{Prompt Template for Synthetic Data Creation}
\label{sec:synth}
\begin{table*}[ht]
\begin{tabular}{p{0.92\textwidth}}\hline
\textbf{Prompt Template}\\\hline
You are an expert who can generate a similar story and dissimilar story given an input anchor story.\\
- For a given anchor story, generate only a single similar story and only a single dissimilar story.\\
- Do not generate any additional anchor texts other than the existing ones in the input CSV file.\\
- For multiple anchor stories generate similar and dissimilar stories for each anchor story.\\
- Save each line in JSON format corresponding to an anchor story, the similar story, and the dissimilar story.\\
- Do not include any explanations or extra text.\\
- Do not generate code or other kinds of textual noise.\\
- For thinking or reasoning, do it internally without including it in the output.\\
- Ensure that the similar story and dissimilar story are coherent and contextually relevant.\\
- The similar story should have similar themes, settings, and character types as the anchor story.\\
- The dissimilar story should have different themes, settings, and character types compared to the anchor story.\\
- The length of each similar story and each dissimilar story should be approximately the same as the corresponding anchor story.\\
- The output format should be as follows:
For n anchor stories:\\
\begin{lstlisting}
{
 {
  "anchor_story": "<anchor_story_text_1>",
  "similar_story": "<similar_story_text_1>",
  "dissimilar_story": "<dissimilar_story_text_1>"
 },
 {
  "anchor_story": "<anchor_story_text_2>",
  "similar_story": "<similar_story_text_2>",
  "dissimilar_story": "<dissimilar_story_text_2>"
 },
  ...
}
\end{lstlisting}\\
Document:
\{Source Document\}\\\hline
\end{tabular}
\caption{Prompt Template for Synthetic Triple Generation}
\label{tab:synth_data}
\end{table*}
\end{document}