\documentclass[11pt]{article}

% Change "review" to "final" to generate the final (sometimes called camera-ready) version.
% Change to "preprint" to generate a non-anonymous version with page numbers.
\usepackage[final]{acl}

% Standard package includes
\usepackage{times}
\usepackage{latexsym}

% For proper rendering and hyphenation of words containing Latin characters (including in bib files)
\usepackage[T1]{fontenc}
% For Vietnamese characters
% \usepackage[T5]{fontenc}
% See https://www.latex-project.org/help/documentation/encguide.pdf for other character sets

% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}

% This is not strictly necessary, and may be commented out,
% but it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage{microtype}

% This is also not strictly necessary, and may be commented out.
% However, it will improve the aesthetics of text in
% the typewriter font.
\usepackage{inconsolata}

%Including images in your LaTeX document requires adding
%additional package(s)
\usepackage{graphicx}

% If the title and author information does not fit in the area allocated, uncomment the following
%
%\setlength\titlebox{<dim>}
%
% and set <dim> to something 5cm or larger.
% 
% \usepackage[hyperref]{acl}
% \usepackage{times}
\usepackage{latexsym}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{amsmath}

\usepackage{enumitem}
\setlist{nosep,leftmargin=*}

\title{clulab-retrieval at SemEval-2026 Task 8: A Comparative Analysis of Dense Retrievers and HyDE for Multi-Turn Conversational Retrieval}

% Author information can be set in various styles:
% For several authors from the same institution:
% \author{Author 1 \and ... \and Author n \\
%         Address line \\ ... \\ Address line}
% if the names do not fit well on one line use
%         Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \And  ... \And
%         Author n \\ Address line \\ ... \\ Address line}
% To start a separate ``row'' of authors use \AND, as in
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \AND
%         Author 2 \\ Address line \\ ... \\ Address line \And
%         Author 3 \\ Address line \\ ... \\ Address line}

\author{Hyungji Kim \quad Siva Rohit Kondapaneni \quad Steven Bethard \\
  University of Arizona \\
  \texttt{\{hyungjikim, sivarohit2002, bethard\}@arizona.edu}}

%\author{
%  \textbf{First Author\textsuperscript{1}},
%  \textbf{Second Author\textsuperscript{1,2}},
%  \textbf{Third T. Author\textsuperscript{1}},
%  \textbf{Fourth Author\textsuperscript{1}},
%\\
%  \textbf{Fifth Author\textsuperscript{1,2}},
%  \textbf{Sixth Author\textsuperscript{1}},
%  \textbf{Seventh Author\textsuperscript{1}},
%  \textbf{Eighth Author \textsuperscript{1,2,3,4}},
%\\
%  \textbf{Ninth Author\textsuperscript{1}},
%  \textbf{Tenth Author\textsuperscript{1}},
%  \textbf{Eleventh E. Author\textsuperscript{1,2,3,4,5}},
%  \textbf{Twelfth Author\textsuperscript{1}},
%\\
%  \textbf{Thirteenth Author\textsuperscript{3}},
%  \textbf{Fourteenth F. Author\textsuperscript{2,4}},
%  \textbf{Fifteenth Author\textsuperscript{1}},
%  \textbf{Sixteenth Author\textsuperscript{1}},
%\\
%  \textbf{Seventeenth S. Author\textsuperscript{4,5}},
%  \textbf{Eighteenth Author\textsuperscript{3,4}},
%  \textbf{Nineteenth N. Author\textsuperscript{2,5}},
%  \textbf{Twentieth Author\textsuperscript{1}}
%\\
%\\
%  \textsuperscript{1}Affiliation 1,
%  \textsuperscript{2}Affiliation 2,
%  \textsuperscript{3}Affiliation 3,
%  \textsuperscript{4}Affiliation 4,
%  \textsuperscript{5}Affiliation 5
%\\
%  \small{
%    \textbf{Correspondence:} \href{mailto:email@domain}{email@domain}
%  }
%}

\begin{document}
\maketitle



\begin{abstract}
We present a comparative analysis of dense retrievers and retrieval strategies for multi-turn conversational retrieval in SemEval-2026 Task 8 (MTRAGEval). Our official submission employed a fine-tuned E5-based dense retriever (E5-FT, $\sim$110M parameters) with Hypothetical Document Embeddings (HyDE), achieving nDCG@5 of .3309, ranking 31 out of 38 systems. On the development set we also compared E5-FT versus BGE embeddings, dense-only versus hybrid retrieval strategies, and HyDE versus keyword extraction approaches. We found: (1) BGE (general-purpose, $\sim$110M) outperforms our domain-fine-tuned E5-FT ($\sim$110M) by 30.5\% on baseline retrieval, suggesting that model selection may matter more than domain-specific fine-tuning, (2) hybrid retrieval combining BM25 and dense methods provides complementary signals, with HyDE improving BM25 by 26.7\% and dense retrieval by 4.0\%, and (3) keyword-based query simplification degrades performance by 11-28\% across domains, validating HyDE's approach of preserving semantic richness through passage-level text.
\end{abstract}

\section{Introduction}

Conversational retrieval systems face unique challenges compared to single-turn retrieval: queries often contain elliptical references, topic shifts, and contextual dependencies across multiple turns. The MTRAGEval benchmark~\cite{katsis-etal-2025-mt,rosenthal2026mtragunbenchmarkopenchallenges,Rosenthal2026MTRAGEval} addresses these challenges with 110 human-created conversations spanning 842 retrieval tasks across four domains.

Our approach leverages Hypothetical Document Embeddings \cite[HyDE; ][]{gao-etal-2023-precise}, which generates hypothetical answer passages that are embedded and used for retrieval instead of the original query. HyDE is particularly valuable in conversational settings where queries may be terse, ambiguous, or context-dependent.

We present a comparative analysis of dense retrievers and retrieval strategies for this task. Our official submission employed a fine-tuned E5-based dense retriever (E5-FT, $\sim$110M parameters) with HyDE, achieving nDCG@5 of .3309. We also explored alternative configurations on our development set, including BGE embeddings, hybrid retrieval (BM25+dense), and various HyDE application strategies.
Our main contributions are:
\begin{itemize}
\item Systematic comparison of dense retrievers for conversational search: E5-FT ($\sim$110M, fine-tuned, domain-specific) versus BGE ($\sim$110M, general-purpose)
\item Evaluation of retrieval strategies: dense-only versus hybrid (BM25+dense)
\item Extensive HyDE ablation studies across models, query formulations, and retrieval methods
\item Keyword extraction ablation showing that query simplification degrades retrieval performance
\item Analysis of factors affecting performance in multi-turn conversational retrieval: training approach (general-purpose vs.\ domain-specific fine-tuning), query formulation sensitivity, and complementary retrieval signals
\end{itemize}

Our system achieved nDCG@5 of .3309 on the official test set, ranking 31 out of 38 systems and below the top baseline (ELSER + Rewrite: .4795) and the top-performing system (.5776). Extensive development set analysis revealed that (1) BGE outperforms our domain-fine-tuned E5-FT by 30.5\%, suggesting model selection may matter more than domain-specific fine-tuning, (2) hybrid retrieval combining BM25 and dense methods provides strong complementary signals, and (3) keyword-based simplification degrades performance by 11-28\% across domains. Our code is available at \url{https://github.com/clulab/semeval2026-task8}.

\section{Background}

\subsection{Task Overview}

MTRAGEval~\cite{rosenthal2026mtragunbenchmarkopenchallenges,Rosenthal2026MTRAGEval} evaluates RAG systems on multi-turn conversational scenarios across four domains: ClapNQ (Wikipedia), Cloud (IBM technical), FiQA (financial), and Govt (government). All conversations and documents are in English. The benchmark contains 110 conversations with an average of 7.7 turns each, totaling 842 evaluation tasks.

The task provides three query formulations: \textbf{last-turn} (only the most recent user question), \textbf{questions} (concatenation of all user questions), and \textbf{rewrite} (standalone rewritten query incorporating necessary context). Each domain uses 512-token passages with 100-token overlap. Retrieval is evaluated using Recall@K and nDCG@K (K=1,3,5,10).

\subsection{Related Work}

\textbf{Conversational retrieval}: Prior work on conversational search~\cite{yu2021few} has explored query rewriting, context modeling, and multi-turn understanding. MTRAGEval adds realistic human-created conversations spanning diverse domains.

\textbf{Hypothetical Document Embeddings}: Gao et al.~\cite{gao-etal-2023-precise} introduced HyDE for web search, demonstrating that generating hypothetical answer passages can improve dense retrieval. Our work extends HyDE to multi-turn conversational settings and provides the first systematic comparison of HyDE's effectiveness on sparse (BM25) versus dense retrieval, showing substantially larger gains on sparse retrieval (+26.7\% vs +4.0\%).

\textbf{Dense retrieval models}: Dense retrieval methods like DPR~\cite{karpukhin-etal-2020-dense} and ColBERT~\cite{khattab2020colbert} have shown strong performance on single-turn retrieval benchmarks. Our comparative analysis of E5-FT (domain-specific, fine-tuned) versus BGE (general-purpose) provides insights about model selection for multi-turn conversational retrieval, suggesting that robustness across query formulations may be more important than domain-specific optimization.

\section{System Overview}

\subsection{Hypothetical Document Embeddings}

HyDE~\cite{gao-etal-2023-precise} bridges the semantic gap between queries and relevant passages by generating a hypothetical answer document, then using its embedding for retrieval. We use Gemini 2.5 Flash for generation with the prompt shown in Figure~\ref{fig:hyde_prompt}.

\begin{figure}
\footnotesize
\begin{verbatim}
You are a helpful assistant who generates
hypothetical answer passages (1–3 sentences).
Given the conversation history and final user
query, write a short, factual paragraph that
directly and concisely answers the final user
query. This passage will be used to find
similar real documents.

Instructions:
- Produce a concise factual passage (1–3
  sentences) that answers the final query.
- Preserve named entities and numeric tokens
  exactly as they appear in the query.
- Do NOT add notes, explanations, or meta-
  comments; output ONLY the hypothetical
  passage.
- Do NOT include any preamble like "Here is
  the answer." Output only the passage text.

Conversation History:
{conversation_history}

Final Query:
{query}
\end{verbatim}
\caption{HyDE prompt for Gemini 2.5 Flash.}
\label{fig:hyde_prompt}
\end{figure}

\subsection{Official Submission System}

Our official submission pipeline consists of three steps: (1) generate a hypothetical answer passage using HyDE, (2) embed the HyDE passage with a fine-tuned dense retriever (E5-FT), and (3) retrieve passages via FAISS nearest neighbor search. We built separate FAISS indices (\texttt{IndexFlatIP}) for each domain using pre-encoded passages.

\subsection{Development Experiments}

Beyond our official submission, we conducted extensive experiments on our development set to compare: (1) dense retrievers (E5-FT vs BGE), (2) retrieval strategies (dense-only vs hybrid BM25+dense), (3) HyDE application variants (BM25 only, dense only, or both), (4) query formulations (last-turn, questions, rewrite) with optional conversation history, and (5) keyword extraction as an alternative to HyDE.

\section{Experimental Setup}

We evaluate on two sets:
\begin{itemize}
\item \textbf{Official test set}: 507 tasks from the SemEval evaluation phase (official metric: nDCG@5)
\item \textbf{Development set}: 4-domain internal development set (ClapNQ, Cloud, FiQA, Govt) used for comparative analysis and ablation studies (metrics: nDCG@5, nDCG@10, Recall@10)
\end{itemize}
We explored the following configurations.
The official submission settings are marked with $\dagger$.
\begin{itemize}

\item Dense retrievers:
\begin{itemize}

\item $\dagger$\textbf{E5-FT}: Fine-tuned E5-based dense retriever, $\sim$110M parameters; base model \texttt{\small intfloat/e5-base-v2} (BertModel, 768-dim); trained on 170,176 domain-balanced query--passage pairs from the shared task using MultipleNegativesRankingLoss (in-batch negatives, cosine similarity, temperature scale 20); 2 epochs, batch size 16, learning rate $5{\times}10^{-5}$, linear schedule, max sequence length 256 tokens, FP16; \texttt{\small sivarohit2002/qwen06b\_bi-e5-ft-weighted}
\item \textbf{BGE}: $\sim$110M-parameter general-purpose bi-encoder, \texttt{\small BAAI/bge-base-en-v1.5}
\end{itemize}

\item Retrieval strategies:
\begin{itemize}
\item $\dagger$\textbf{Dense-only}: Using dense retriever with Top-K=10 from FAISS inner product similarity (\texttt{\small IndexFlatIP}); all passage and query embeddings are L2-normalized before indexing and search, making inner product equivalent to cosine similarity
\item \textbf{Hybrid}: Combining BM25 (Elasticsearch) with dense retrieval
\item \textbf{HyDE application}: Applied to BM25 only, dense only, or both
\end{itemize}

\item Query formulations: \textbf{last-turn}, \textbf{questions}, $\dagger$\textbf{rewrite} (with optional conversation history; provided by organizers)

\item Query reformulation:
\begin{itemize}
    \item $\dagger$\textbf{HyDE}: Gemini 2.5 Flash, temperature=0.7, max\_tokens=200
    \item \textbf{Keyword} Keyword extraction using Gemini 2.5 Flash (temperature=0, few-shot prompting), outputting canonical 2-6 word noun phrases, preserving named entities, acronyms, and numbers.
\end{itemize}

\item General settings: All results from single runs executed on NVIDIA A100 (40GB)
\end{itemize}

\section{Results}

\subsection{Official Submission Performance}

\begin{table}
\centering
\small
\begin{tabular}{lc}
\toprule
\textbf{System} & \textbf{nDCG@5} \\
\midrule
\textit{Baselines (from organizers)} \\
Top Performing System & .5776 \\
Top Baseline (ELSER + Rewrite) & .4795 \\
\midrule
\textit{Our Submission} \\
\textbf{clulab-retrieval (E5-FT + HyDE)} & \textbf{.3309} \\
\bottomrule
\end{tabular}
\caption{Official submission results on the test set.}
% Our E5-FT + HyDE dense-only system achieved .3309, compared to the top baseline of .4795 (ELSER + Rewrite).}
\label{tab:official}
\end{table}

Table~\ref{tab:official} shows our official submission results on the test set.
The following sections perform comparative analyses of alternative retrieval configurations on our development set.

\subsection{Model Comparison: E5-FT vs BGE}


\begin{table}[t]
\centering
\small
\setlength{\tabcolsep}{3pt}
\begin{tabular}{lcccccccc}
\toprule
& \multicolumn{4}{c}{\textbf{E5-FT (Dense-only)}} & \multicolumn{4}{c}{\textbf{BGE (Dense-only)}} \\
\cmidrule(lr){2-5} \cmidrule(lr){6-9}
\textbf{Domain} & \multicolumn{2}{c}{Baseline} & \multicolumn{2}{c}{+ HyDE} & \multicolumn{2}{c}{Baseline} & \multicolumn{2}{c}{+ HyDE} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9}
%& nDCG@10 & R@10 & nDCG@10 & R@10 & nDCG@10 & R@10 & nDCG@10 & R@10 \\
& G10 & R10 & G10 & R10 & G10 & R10 & G10 & R10 \\
\midrule
ClapNQ & .425 & .515 & .507 & .588 & .492 & .591 & .519 & .623 \\
Cloud  & .191 & .248 & .205 & .267 & .343 & .423 & .369 & .452 \\
FiQA   & .257 & .337 & .293 & .390 & .341 & .418 & .334 & .415 \\
Govt   & .348 & .472 & .372 & .516 & .415 & .518 & .434 & .533 \\
\midrule
\textbf{Macro} & \textbf{.305} & \textbf{.393} & \textbf{.344} & \textbf{.440} & \textbf{.398} & \textbf{.488} & \textbf{.414} & \textbf{.506} \\
%\textit{vs QwenFT baseline} & --- & --- & --- & --- & \textit{+30.5\%} & --- & \textit{+35.7\%} & --- \\
\bottomrule
\end{tabular}
\caption{Comparison of models (E5-FT versus BGE) on the development set. nDCG@10 and R@10 are abbreviated as G10 and R10, respectively.}
%BGE baseline (nDCG@10=.398) outperforms QwenFT+HyDE (.344) by 15.7\%, demonstrating that model choice has substantial impact on retrieval performance. Both models benefit from HyDE (E5-FT: +12.8\%, BGE: +4.0\%). Per-domain BGE results are not available.}
\label{tab:model_comparison}
\end{table}

Table~\ref{tab:model_comparison} compares E5-FT vs. BGE.
Although both models are $\sim$110M parameters (BERT-base scale), BGE's baseline performance (nDCG@10=.398) exceeds E5-FT's HyDE-augmented performance (nDCG@10=.344) by 15.7\%. Because BGE and E5-FT differ in both base model and training objective, this comparison is observational rather than controlled; nonetheless, it suggests that general-purpose retrieval training may outperform domain-specific fine-tuning even at the same model scale.

Both models benefit from HyDE, but with different magnitudes: E5-FT improves by +12.8\% while BGE improves by +4.0\%. This suggests that stronger baseline models see smaller relative improvements from HyDE, though absolute performance remains higher.


\subsubsection{Model Size vs Domain Adaptation}


\begin{table}
\centering
\small
\setlength{\tabcolsep}{2pt}
\begin{tabular}{lcccc}
\toprule
& \multicolumn{2}{c}{\textbf{FiQA (Finance)}} & \multicolumn{2}{c}{\textbf{Cloud (Technical)}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5}
\textbf{Embedding Model} & \textbf{Rewritten} & \textbf{HyDE} & \textbf{Rewritten} & \textbf{HyDE} \\
\midrule
\texttt{e5-ft} & \textbf{.257} & \textbf{.293} & \textbf{.191} & \textbf{.215} \\
\texttt{gte-Qwen2-1.5B} & .162 & .186 & .069 & .114 \\
%\midrule
%\textit{Performance Drop} & \textit{-37\%} & \textit{-36\%} & \textit{-64\%} & \textit{-47\%} \\
\bottomrule
\end{tabular}
\caption{Comparison of model size vs. domain adaptation: development set nDCG@10 scores for domain-specific fine-tuned E5-FT ($\sim$110M) and general domain instruction-tuned gte-Qwen2-1.5B (1.5B).}
% Despite being 2.5$\times$ larger, the instruction-tuned model significantly underperforms in domain-specific retrieval, validating our choice of task-specific fine-tuning over generic instruction-tuning.}
\label{tab:model_ablation}
\end{table}

Table~\ref{tab:model_ablation} compares E5-FT, which has been fine-tuned to the task domains, to gte-Qwen2-1.5B, a $\sim$13.6$\times$ larger general-domain instruction-tuned model.
Despite its larger size, gte-Qwen2-1.5B dramatically underperforms E5-FT on domain-specific retrieval, with particularly severe degradation on Cloud (-64\% rewritten, -47\% HyDE) and FiQA (-37\% rewritten, -36\% HyDE). This confirms that domain-adapted fine-tuning is more effective than instruction-tuning for specialized retrieval tasks, even when the instruction-tuned model has $\sim$13.6$\times$ more parameters. However, the comparison between E5-FT and BGE (Table~\ref{tab:model_comparison}) suggests that general-purpose retrieval training (BGE) can outperform domain-specific fine-tuning when the base model has broader coverage and better robustness.

\subsection{Retrieval Strategy Comparison: Dense-Only vs Hybrid}

\begin{table}
\centering
\small
\setlength{\tabcolsep}{3pt}
\begin{tabular}{lcccc}
\toprule
\textbf{Component} & \multicolumn{2}{c}{\textbf{Baseline}} & \multicolumn{2}{c}{\textbf{+ HyDE}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5}
& nDCG@10 & R@10 & nDCG@10 & R@10 \\
\midrule
Sparse (BM25) & .247 & .320 & .313 & .393 \\
%\quad \textit{HyDE improvement} & --- & --- & \textit{+26.7\%} & \textit{+22.8\%} \\
%\midrule
Dense (BGE) & .398 & .488 & .414 & .506 \\
%\quad \textit{HyDE improvement} & --- & --- & \textit{+4.0\%} & \textit{+3.7\%} \\
\bottomrule
\end{tabular}
\caption{Comparison of retrieval strategies on the development set with structured conversation history.}
% BM25 shows large gains from HyDE (+26.7\% nDCG@10), while dense retrieval shows smaller but consistent gains (+4.0\%). Results suggest complementary retrieval signals: BM25 captures lexical matches while dense retrieval handles semantic similarity.}
\label{tab:hybrid}
\end{table}

Table~\ref{tab:hybrid} compares sparse, dense, and hybrid retrieval approaches.
BM25 benefits substantially from HyDE (+26.7\% nDCG@10), while dense retrieval shows smaller gains (+4.0\%). This asymmetry suggests that hypothetical answer passages provide valuable lexical expansion for sparse retrieval, while dense retrievers already capture much of the semantic information that HyDE provides.
The hybrid approach provides complementary signals: BM25 excels at exact matches (entities, technical terms, numeric values) while dense retrieval handles paraphrases and conceptual similarity.

\subsection{HyDE Ablation}

\begin{table}
\centering
\small
\setlength{\tabcolsep}{2pt}
\begin{tabular}{lcccc}
\toprule
\textbf{Configuration} & \textbf{Model} & \textbf{Apply} & \textbf{Baseline} & \textbf{+HyDE} \\
\midrule
\multicolumn{5}{l}{\textit{Dense-only (nDCG@10)}} \\
E5-FT, rewrite & E5-FT & Dense & .305 & .344 \\%(+12.8\%) \\
BGE, rewrite & BGE & Dense & .398 & .414 \\%(+4.0\%) \\
BGE, lastturn & BGE & Dense & .339 & .365 \\%(+7.7\%) \\
\midrule
\multicolumn{5}{l}{\textit{Hybrid (nDCG@10)}} \\
BGE, rewrite & BGE & BM25 only & .247 & .313 \\%(+26.7\%) \\
BGE, rewrite & BGE & Dense only & .398 & .414 \\%(+4.0\%) \\
\bottomrule
\end{tabular}
\caption{Ablation of HyDE across configurations on the development set.}
% HyDE consistently improves retrieval across all settings. Gains vary by model (E5-FT: +12.8\%, BGE: +4.0\%), query type (rewrite vs lastturn), and component (BM25: +26.7\%, dense: +4.0\%).}
\label{tab:hyde_ablation}
\end{table}

Table~\ref{tab:hyde_ablation} compares configurations with and without HyDE.
HyDE provides consistent improvements across all configurations tested.
Weaker baselines show larger relative improvements (E5-FT: +12.8\% vs BGE: +4.0\%).
BM25 shows the largest gains (+26.7\%), suggesting HyDE's lexical expansion is particularly valuable for sparse retrieval.
Query formulation affects baseline and HyDE performance (rewrite $>$ lastturn for both models).
All HyDE results reported here are from single runs. We did not observe obvious instability across manual inspection of generated passages: for a given query, HyDE consistently produced factually similar hypothetical passages, suggesting limited within-query variance. However, we did not run systematic multi-seed experiments to quantify variance.
\subsection{Keyword Extraction}

\begin{table*}[t]
\centering
\small
\setlength{\tabcolsep}{2.5pt}
\begin{tabular}{lcccccccc}
\toprule
& \multicolumn{2}{c}{\textbf{ClapNQ}} & \multicolumn{2}{c}{\textbf{Cloud}} & \multicolumn{2}{c}{\textbf{FiQA}} & \multicolumn{2}{c}{\textbf{Govt}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9}
\textbf{Query Method} & \textbf{nDCG@10} & \textbf{R@10} & \textbf{nDCG@10} & \textbf{R@10} & \textbf{nDCG@10} & \textbf{R@10} & \textbf{nDCG@10} & \textbf{R@10} \\
\midrule
\multicolumn{9}{l}{\textit{Baseline: Standard Rewriting}} \\
Rewritten Query & .425 & .515 & .191 & .248 & .257 & .337 & .348 & .472 \\
\midrule
\multicolumn{9}{l}{\textit{Strategy: Hypothetical Document Embeddings (HyDE)}} \\
HyDE (from Last Turn) & .506 & \textbf{.597} & \textbf{.215} & \textbf{.276} & \textbf{.293} & .370 & .336 & .487 \\
HyDE (from Rewritten) & \textbf{.507} & .588 & .205 & .267 & \textbf{.293} & \textbf{.390} & \textbf{.372} & \textbf{.516} \\
\midrule
\multicolumn{9}{l}{\textit{Strategy: Keyword Extraction (Ablation)}} \\
Keywords (Last Turn) & .320 & .410 & .182 & .228 & .176 & .227 & .260 & .365 \\
Keywords (Rewritten) & .379 & .477 & .167 & .218 & .214 & .285 & .302 & .438 \\
Keywords (HyDE Last) & .335 & .415 & .150 & .192 & .202 & .265 & .269 & .392 \\
Keywords (HyDE Rewr) & .375 & .472 & .167 & .210 & .196 & .256 & .277 & .399 \\
\bottomrule
\end{tabular}
\caption{Keyword extraction ablation on development set using E5-FT (nDCG@10 and Recall@10).}
% Keyword-based methods consistently underperform both standard rewriting and HyDE across all domains. Even extracting keywords from HyDE-generated passages degrades performance compared to using full HyDE passages.}
\label{tab:keyword_ablation}
\end{table*}

Table~\ref{tab:keyword_ablation} compares HyDE-style rewriting to an alternative rewriting: keyword extraction (converting queries into canonical 2-6 word noun phrases).
Keyword extraction consistently underperforms both baseline rewriting and HyDE across all domains. Keywords from rewritten queries show 11-28\% degradation in nDCG@10 compared to the rewritten baseline (ClapNQ: -11\%, Cloud: -13\%, FiQA: -17\%, Govt: -13\%). Even when keywords are extracted from HyDE-generated passages, performance remains substantially below both the rewritten baseline and HyDE approaches.
%This negative result validates our choice of HyDE with full passages over keyword-based approaches and provides three insights: (1) \textbf{Conversational queries benefit from full context}: Dense retrievers can leverage complete query phrasing and context rather than requiring reduction to keywords, (2) \textbf{Lexical precision cannot compensate for semantic loss}: While keywords provide exact term matches, they lose important semantic signals present in natural language queries, and (3) \textbf{HyDE's advantage is passage-level semantics}: Extracting keywords from HyDE passages eliminates the semantic richness that makes HyDE effective, suggesting that HyDE's benefit comes from generating natural passage-like text rather than just expanding vocabulary.
The particularly large degradation on technical domains (Cloud: -13\%, FiQA: -17\%) suggests that domain-specific queries especially require full context and terminology that keywords alone cannot capture.

\subsection{Discussion}

\subsubsection{General-Purpose Training vs.\ Domain-Specific Fine-Tuning}

Our comparison of E5-FT ($\sim$110M parameters, domain-specific fine-tuning) and BGE ($\sim$110M parameters, general-purpose) shows that BGE outperforms E5-FT by 30.5\% on baseline retrieval. Even when E5-FT uses HyDE, BGE without HyDE still exceeds it by 15.7\%, and BGE+HyDE outperforms E5-FT+HyDE by 20.3\%. Because these two models differ in both base model architecture and training objective, the comparison is observational rather than a controlled ablation of training approach.

We hypothesize that BGE's training on diverse retrieval tasks provides better robustness to varied query formulations and domain-specific language. E5-FT's domain-specific fine-tuning may be overfitting particular query patterns, reducing performance on queries outside the fine-tuning distribution. Confirming this hypothesis would require fine-tuning from the same base model with and without domain-specific data.

\subsubsection{Complementary Retrieval Signals}

The large performance gap between BM25's HyDE improvement (+26.7\%) and dense retrieval's improvement (+4.0\%) highlights the complementary nature of sparse and dense retrieval. BM25 benefits from HyDE's lexical expansion—converting terse queries into passage-like text with richer vocabulary. Dense retrieval, already operating in semantic space, sees smaller gains as it can bridge vocabulary gaps without explicit lexical matching.

This complementarity suggests that hybrid retrieval strategies may be particularly valuable for conversational search, where queries vary widely in formulation (from single words to complete sentences) and context requirements.

\subsubsection{Query Formulation Sensitivity}

Our results show that both baseline performance and HyDE improvements vary substantially with query formulation. Rewritten queries (standalone, context-incorporated) outperform lastturn queries (contextless, potentially ambiguous) by 17\% for E5-FT baseline and 35\% for BGE baseline.

Interestingly, BGE shows more consistent performance across query types than E5-FT, suggesting better robustness to query formulation variance. This robustness is consistent with our hypothesis that general-purpose retrieval training may yield more stable representations across varied query formulations than domain-specific fine-tuning.

\subsubsection{Error Analysis}

Manual inspection of 50 challenging queries reveals common failure patterns:

\textbf{HyDE hallucination}: HyDE occasionally generates specific details not present in queries (e.g., inventing specific company names or dates when the query asks about "a company" or "recent events"). These hallucinated details can lead retrieval toward irrelevant but lexically-matching passages.

\textbf{Topic ambiguity}: When queries admit multiple interpretations (e.g., "What about security?" could mean cybersecurity, physical security, or financial security), HyDE must commit to one interpretation, potentially missing relevant passages addressing alternative interpretations.

\textbf{Query formulation sensitivity}: E5-FT shows a larger performance variance across query types than BGE. For example, concatenated question strings (questions variant) cause significant QwenFT performance degradation, while BGE remains relatively stable.

\subsubsection{Reflections on Official Submission}

Our official submission's performance (nDCG@5 of .3309) can be attributed to several factors identified through our comparative analysis:

\begin{enumerate}
\item \textbf{Suboptimal model selection}: E5-FT's baseline performance (dev nDCG@10 of .305) is 30.5\% below BGE baseline (.398)
\item \textbf{Dense-only approach}: Our submission did not leverage BM25's complementary signal, which shows +26.7\% improvement from HyDE
\item \textbf{Limited robustness}: E5-FT's sensitivity to query formulation may have hurt performance on varied test set queries
\end{enumerate}
While we cannot evaluate BGE or hybrid retrieval on the official test set due to time constraints and data preprocessing differences, our development set analysis suggests these alternative configurations merit investigation in future work.

\section{Conclusion}

We presented a comparative analysis of dense retrievers and retrieval strategies for multi-turn conversational retrieval in SemEval-2026 Task 8. Our official submission (E5-FT + HyDE, dense-only) achieved nDCG@5=.3309.

Through extensive development set experiments we found:

% \paragraph{Model robustness matters:} BGE (110M parameters, general-purpose) outperforms QwenFT (600M parameters, domain-specific) by 30.5\% on baseline retrieval, suggesting that robustness across query formulations is more valuable than model size or domain-specific fine-tuning

\paragraph{BGE outperforms E5-FT, suggesting general-purpose training may matter more than domain-specific fine-tuning:} BGE ($\sim$110M parameters, general-purpose) outperforms E5-FT ($\sim$110M parameters, domain-specific fine-tuned) by 30.5\% on baseline retrieval. Because the two models differ in both base model and training objective, this is an observational finding; we hypothesize that breadth of retrieval training contributes to BGE's stronger and more robust performance

\paragraph{Hybrid retrieval provides complementary signals:} BM25 and dense retrieval show asymmetric responses to HyDE (+26.7\% vs +4.0\%), suggesting they capture different aspects of relevance

\paragraph{HyDE consistently helps:} Across all configurations tested, HyDE provides improvements ranging from +4.0\% to +26.7\%, with particularly strong gains on sparse retrieval

\paragraph{Keyword simplification hurts:} Reducing queries to keyword phrases degrades performance by 11-28\% across domains, demonstrating that dense retrievers benefit from preserving full semantic context rather than lexical precision alone

\medskip
These findings suggest future research directions: investigating model robustness metrics beyond single-turn benchmarks, exploring optimal fusion strategies for hybrid retrieval, and developing HyDE variants that balance lexical expansion with semantic precision.

\newpage
\section*{Acknowledgments}

We thank the SemEval 2026 Task 8 organizers for creating the MTRAGEval benchmark and providing comprehensive evaluation infrastructure.

%\bibliographystyle{acl_natbib}
\bibliography{anthology-1,references}

\end{document}
