\documentclass[11pt]{article}

% Change "review" to "final" to generate the final (sometimes called camera-ready) version.
% Change to "preprint" to generate a non-anonymous version with page numbers.
\usepackage[final]{acl}
% Standard package includes
\usepackage{times}
\usepackage{latexsym}
\usepackage{tabularx}
\usepackage{amsmath}
\usepackage{listings}
\usepackage{xcolor}

\lstset{
  basicstyle=\ttfamily\footnotesize,
  breaklines=true,
  breakatwhitespace=true,
  columns=fullflexible,
  showstringspaces=false,
  frame=single
}
% For proper rendering and hyphenation of words containing Latin characters (including in bib files)
\usepackage[T1]{fontenc}
% For Vietnamese characters
% \usepackage[T5]{fontenc}
% See https://www.latex-project.org/help/documentation/encguide.pdf for other character sets

% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}

% This is not strictly necessary, and may be commented out,
% but it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage{microtype}

% This is also not strictly necessary, and may be commented out.
% However, it will improve the aesthetics of text in
% the typewriter font.
\usepackage{inconsolata}

%Including images in your LaTeX document requires adding
%additional package(s)
\usepackage{graphicx}

% If the title and author information does not fit in the area allocated, uncomment the following
%
%\setlength\titlebox{<dim>}
%
% and set <dim> to something 5cm or larger.

\title{CuriosAI at SemEval-2026 Task 8: Hybrid retrieval system with repeated sampling for generation}

% Author information can be set in various styles:
% For several authors from the same institution:
% \author{Author 1 \and ... \and Author n \\
%         Address line \\ ... \\ Address line}
% if the names do not fit well on one line use
%         Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \And  ... \And
%         Author n \\ Address line \\ ... \\ Address line}
% To start a separate ``row'' of authors use \AND, as in
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \AND
%         Author 2 \\ Address line \\ ... \\ Address line \And
%         Author 3 \\ Address line \\ ... \\ Address line}

% \author{First Author \\
%   Affiliation / Address line 1 \\
%   Affiliation / Address line 2 \\
%   Affiliation / Address line 3 \\
%   \texttt{email@domain} \\\And
%   Second Author \\
%   Affiliation / Address line 1 \\
%   Affiliation / Address line 2 \\
%   Affiliation / Address line 3 \\
%   \texttt{email@domain} \\}

\author{
 \textbf{Aiswariya Manoj Kumar\textsuperscript{1}},
 \textbf{Hiroki Takushima\textsuperscript{1}},
 \textbf{Fumika Beppu\textsuperscript{1}},
\\
 \textbf{Yuki Shibata\textsuperscript{1}},
 \textbf{Daichi Yamaga\textsuperscript{1}},
  \textbf{Takayuki Hori\textsuperscript{1}},
%  \textbf{Seventh Author\textsuperscript{1}},
%  \textbf{Eighth Author \textsuperscript{1,2,3,4}},
%\\
%  \textbf{Ninth Author\textsuperscript{1}},
%  \textbf{Tenth Author\textsuperscript{1}},
%  \textbf{Eleventh E. Author\textsuperscript{1,2,3,4,5}},
%  \textbf{Twelfth Author\textsuperscript{1}},
%\\
%  \textbf{Thirteenth Author\textsuperscript{3}},
%  \textbf{Fourteenth F. Author\textsuperscript{2,4}},
%  \textbf{Fifteenth Author\textsuperscript{1}},
%  \textbf{Sixteenth Author\textsuperscript{1}},
%\\
%  \textbf{Seventeenth S. Author\textsuperscript{4,5}},
%  \textbf{Eighteenth Author\textsuperscript{3,4}},
%  \textbf{Nineteenth N. Author\textsuperscript{2,5}},
%  \textbf{Twentieth Author\textsuperscript{1}}
\\
\\
 \textsuperscript{1}SoftBank Corp.,
%  \textsuperscript{2}Affiliation 2,
%  \textsuperscript{3}Affiliation 3,
%  \textsuperscript{4}Affiliation 4,
%  \textsuperscript{5}Affiliation 5
\\
 \small{
   \textbf{Correspondence:} \href{mailto:aiswariya.manojkumar@g.softbank.co.jp}{aiswariya.manojkumar@g.softbank.co.jp} 
 }
}

\begin{document}
\maketitle
\begin{abstract}
SemEval-2026 Task 8 (MTRAGEval) evaluates multi-turn Retrieval-Augmented Generation (RAG) under conversational challenges such as non-standalone turns, underspecification, and answerability detection. These conditions amplify retrieval and generation errors that standard single-turn RAG pipelines fail to address effectively. We present a robustness-oriented multi-turn RAG system combining contextual query rewriting, heterogeneous hybrid retrieval fused with Reciprocal Rank Fusion (RRF), domain-adaptive Low-Rank Adaptation (LoRA) reranking, and repeated sampling with metric-guided selection. On the official test set, our approach outperforms the organizers’ baselines across all subtasks: Retrieval (nDCG@5: 0.5396 vs. 0.4795), Generation (0.7571 vs. 0.6390), and RAG (0.5486 vs. 0.5366). Our system ranks 5th in Subtask A, 5th in Subtask B, and 7th in Subtask C on the official leaderboard. These results demonstrate that calibrated hybrid retrieval combined with robust generation selection is effective for multi-turn RAG.

\end{abstract}

\section{Introduction}
Multi-turn Retrieval-Augmented Generation (RAG) must interpret dialogue context, resolve implicit references, and decide when available evidence is insufficient. MTRAGEval \citep{Katsis2025MTRAG, MTRAGEvalProposal2026} targets these behaviors through non-standalone and underspecified queries, as well as answerability-sensitive turns. In such settings, small upstream failures (e.g., imperfect rewriting or retrieval noise) can cascade into ungrounded or mismatched responses, making pipeline robustness central.

Our system targets these failure modes with four components: (i) contextual query rewriting to produce retrieval-ready standalone queries \citep{Zhou2023UnifiedCQR,Sun2023ImprovingCQR}; (ii) heterogeneous hybrid retrieval combining sparse and dense retrievers and fusing candidates via Reciprocal Rank Fusion (RRF) \citep{Cormack2009RRF}; (iii) domain-adaptive reranking using a Low-Rank Adaptation (LoRA) \citep{Hu2022LoRA} fine-tuned Qwen3-Reranker-8B \citep{Zhang2025Qwen3Embedding}; and (iv) repeated sampling with metric-guided selection to reduce generation variance under noisy evidence.

On the official evaluation, we achieve 0.5396 nDCG@5 for Retrieval, 0.7571 for Generation, and 0.5486 for RAG, improving over the official baselines across all subtasks. These results indicate that hybrid retrieval substantially improves coverage in multi-turn settings, while repeated generation with metric-aligned selection enhances stability under noisy retrieval. Remaining challenges include handling deeply underspecified queries and multi-hop reasoning across distant passages.

\section{Background}

\subsection{Task Setup}
MTRAGEval evaluates Multi-turn RAG systems under conversational settings that resemble real-world information-seeking dialogues. Each instance is defined as the full dialogue history up to the current turn plus the latest user query. Systems must use conversational context for reference resolution and answerability decisions \citep{MTRAGEvalProposal2026}. The benchmark includes three subtasks:

\begin{itemize}
    \item \textbf{Subtask A (Retrieval):} Retrieve relevant passages for the current turn.
    \item \textbf{Subtask B (Generation):} Generate an answer for the current turn using reference passages provided by the organizers.
    \item \textbf{Subtask C (RAG):} Perform end-to-end retrieval for the current turn, followed by grounded generation.
\end{itemize}

Table~\ref{tab:subtasks} summarizes the input and output requirements for each subtask.

\begin{table*}[t]
\centering
\small
\begin{tabular}{l p{5cm} p{7cm}}
\hline
\textbf{Subtask} & \textbf{Input} & \textbf{Output} \\
\hline
Subtask A & Full conversation & Top-10 ranked passages \\
Subtask B & Full conversation + reference passages & Grounded answer to the last turn of conversation \\
Subtask C & Full conversation & Grounded answer to the last turn of conversation \\
\hline
\end{tabular}
\caption{Summary of inputs and outputs for each MTRAGEval subtask.}
\label{tab:subtasks}
\end{table*}

\subsection{Dataset}
MTRAGEval builds upon the MTRAG benchmark \citep{Katsis2025MTRAG} and extends it with evaluation tasks targeting challenging conversational properties. The development data consists of 110 manually created and reviewed English conversations, comprising 842 tasks across four domains: ClapNQ, FiQA, Govt, and Cloud. 
% \begin{itemize}
%     \item CLAPNQ (Wikipedia-based QA)
%     \item FiQA (financial advice from StackExchange)
%     \item Govt (web-crawled content from selected .gov and .mil domains)
%     \item Cloud (technical documentation from cloud services)
% \end{itemize}
We provide additional corpus analysis in Appendix \ref{sec:eda}.

For the test phase, the organizers provided 507 tasks derived from unseen dialogue contexts \citep{MTRAGEvalProposal2026}. These test tasks (MTRAG-UN) \citep{Katsis2026MTRAGUN} contain a higher proportion of non-standalone, underspecified, and answerability-sensitive instances compared to earlier benchmark of MTRAG.

\subsection{Related Work}

Contextual query rewriting has been shown to improve retrieval quality for multi-turn dialogue systems by expanding non-standalone user queries into standalone forms \citep{Zhou2023UnifiedCQR, Sun2023ImprovingCQR}. Hybrid retrieval methods that combine sparse and dense representations, such as RRF, have demonstrated robustness across heterogeneous corpora \citep{Cormack2009RRF}. Adaptation techniques like LoRA enable efficient fine-tuning of large reranking models \citep{Hu2022LoRA}, while repeated sampling strategies have been explored to enhance generation reliability \citep{Wang2023SelfConsistency}. Our work builds on these foundations and focuses on robustness across stages in multi-turn conversational RAG.

\section{System Overview}

Our system is designed to mitigate error propagation in multi-turn RAG by explicitly addressing conversational ambiguity, retrieval instability, and generation variance. These are handled through conversational rewriting, hybrid retrieval with domain-adaptive reranking, and metric-guided generation. Figure \ref{fig:retrieval_pipeline} and \ref{fig:generation_pipeline} illustrate the retrieval and generation pipelines respectively.

\subsection{Conversational Query Rewriting}
We adopt the contextual query rewriting strategy provided in the official baseline \citep{Katsis2025MTRAG}. Given the full conversation history and the current user query, GPT-5 \citep{AzureOpenAI2025} rewrites the latest turn into a standalone query that preserves user intent while resolving implicit references.

The rewriting prompt follows the baseline formulation, and differs only in the underlying language model used for generation. This approach is consistent with prior contextual query rewriting work \citep{Zhou2023UnifiedCQR, Sun2023ImprovingCQR}.

\subsection{Hybrid Retrieval Framework}
\begin{figure*}[t]
\centering
\includegraphics[width=0.8\textwidth]{latex/mtrag_r_task_updated.pdf}
\caption{Hybrid retrieval pipeline. The conversational query is first rewritten into a standalone form. Multiple sparse and dense retrievers independently retrieve top candidates, which are fused using RRF and reranked via LoRA-fine-tuned Qwen3-Reranker-8B. The final top-5 passages are passed to the generator.}
\label{fig:retrieval_pipeline}
\end{figure*}

\subsubsection{Preprocessing and Indexing}
We used the organizers’ passage-level corpora and retain the original chunking. Before indexing, passages are normalized with NFKC and cleaned to remove control characters and common crawl artifacts (HTML remnants, pagination stubs, and boilerplate).

\subsubsection{Summary augmentation}
For each passage chunk, we generated a concise summary using Qwen3VL-32B-Instruct \citep{bai2025qwen3vltechnicalreport} and appended it as an additional field (\texttt{summary}). The original text content remained unchanged. This augmentation enriched dense semantic representations while preserving passage granularity.

\subsubsection{Multi-Index Retrieval}

We constructed four independent retrieval indices over the cleaned corpora with appended summaries.

\begin{itemize}
    \item Sparse lexical retrieval: SPLADE-v3 \citep{Lassance2024SPLADEv3}
    \item Dense embedding: NV-Embed-v2 \citep{Lee2024NVEmbed}
    \item Dense embedding: Qwen3-Embedding-8B \citep{Zhang2025Qwen3Embedding}
    \item Dense embedding: text-embedding-3-large \citep{OpenAI2024TextEmbedding3}
\end{itemize}

Each retriever returns the top-100 passages for the rewritten query. The diversity across sparse and dense representations improves coverage within domain-specific corpora that contain varying lexical and structural characteristics.

\subsubsection{Reranking}
We fine-tune Qwen3-Reranker-8B with LoRA \citep{Hu2022LoRA} and rerank retrieved candidates. Development results show consistent gains in Cloud but limited or negative impact in ClapNQ, FiQA, and Govt (Table~\ref{tab:domain_rerank}). We therefore adopt a domain-adaptive strategy: reranking is applied only for Cloud and retains original rankings elsewhere.
Additional fine-tuning details are provided in the Appendix \ref{sec:hypa_reranker}.

\subsubsection{Weighted Reciprocal Rank Fusion}

The reranked lists from all retrievers are combined using a weighted variant of RRF \citep{Cormack2009RRF}. For document $d$, the fused score is computed as:

\[
RRF(d) = \sum_{r \in R} w_r \cdot \frac{1}{k + \text{rank}_r(d)}
\]
where $R$ denotes the set of reranked retrieval systems, $w_r$ is the weight assigned to system $r$, and $k$ is a smoothing parameter.

Weights and the smoothing parameter $k$ were selected via grid search on the development split (details in Section~\ref{sec:hypa_retrieval}). The final top-10 passages after fusion are provided to the generator.

\subsection{Generation Strategy}

\begin{figure}[t]
\centering
\includegraphics[width=\columnwidth]{latex/mtrag_ag_task_updated.pdf}
\caption{Generation pipeline with repeated sampling and dual-score selection. Five candidate responses are generated and scored using metric-aligned evaluation signals. The final output maximizes harmonic scoring.}
\label{fig:generation_pipeline}
\end{figure}

\subsubsection{Prompting}

Generation uses GPT-5 with a structured grounding prompt enforcing: (i) reliance solely on retrieved passages for factual content, (ii) explicit handling of fully, partially, and unanswerable queries, (iii) message-type-aware response formatting, and (iv) answering only the latest turn. This aligns generation behavior with official faithfulness and answerability criteria.

\subsubsection{Repeated Sampling and Metric-Guided Selection}
To mitigate generation variance under imperfect retrieval, we generate $n=5$ independent candidate responses $\{y_1, \dots, y_n\}$. This approach increases the likelihood that at least one candidate is both well-grounded and contextually appropriate.
Each candidate is scored using:
\begin{itemize}
\item \textbf{Faithfulness Score} $R\_f$ computed using Retrieval-Augmented Generation Assessment Score (RAGAS) \citep{Es2023RAGAS}, measuring grounding with respect to the retrieved passages.
\item \textbf{LLM-as-a-Judge Score} \citep{Zheng2023JudgingLLMs} $R\_llm$ computed using a GPT-based evaluator with a simplified prompt inspired by the official evaluation setup,\footnote{\url{https://github.com/IBM/mt-rag-benchmark/blob/main/scripts/evaluation/judge_utils.py}} focusing on context relevance and hallucination penalization. This internal judge was used only for candidate selection and was not identical to the organizers' official $RB_{llm}$ scorer, since it used an independently written prompt without reference-answer conditioning. The complete prompt is provided in Appendix~\ref{sec:eval_prompt} for transparency.
\end{itemize}
The final output is selected as:
\[
y^* = \arg\max_{y_i} \; \text{HM}\big(R\_f, R\_llm\big)
\]
where $\text{HM}$ denotes the harmonic mean. This formulation favors responses that are simultaneously well-grounded and conversationally appropriate, discouraging candidates that score highly on only one criterion.

% Unlike majority-vote self-consistency decoding \citep{Wang2023SelfConsistency}, which aggregates reasoning trajectories, our approach performs metric-guided candidate selection aligned with evaluation criteria.

\section{Experimental Setup}

\subsection{Data Splits}
We tune all hyperparameters on the official development split only and report final scores from organizer evaluation on the held-out test set.

\subsection{Hyperparameter Tuning}
Retrieval hyperparameters for weighted RRF were tuned via grid search on the development split, optimizing retriever weights and smoothing parameter $k$ with respect to the official evaluation metric nDCG@5. 

For generation, five candidate responses were sampled per query. No decoding temperature was manually tuned due to API constraints. Final answer selection used metric-guided harmonic scoring.

Detailed search spaces and training configurations are provided in the Appendix \ref{sec:hypa_retrieval}.

\section{Results}
Table~\ref{tab:official_results} reports official test-set performance on all three subtasks, along with our leaderboard rankings. The organizers’ baseline follows the MTRAG-UN pipeline \citep{Katsis2026MTRAGUN}: (i) contextual query rewriting with gpt-oss-20b, (ii) ELSER-based sparse retrieval for Subtask A, and (iii) structured grounded generation with answerability logic using gpt-oss-120b (Subtask B) and qwen-30b-a3b-thinking (Subtask C). This baseline constitutes a strong multi-turn RAG system under the benchmark’s conversational phenomena.

Beyond the organizers’ baseline, leaderboard rankings provide broader context, with our system placing \textbf{5th} in Subtask A, \textbf{5th} in Subtask B, and \textbf{7th} in Subtask C, demonstrating competitive performance among participating systems.

Our system improves over the baseline across all subtasks. The largest absolute gain is on \textbf{Subtask A} (+0.0601 nDCG@5), suggesting that heterogeneous hybrid retrieval with weighted RRF substantially improves evidence coverage in multi-domain, later-turn settings. We also observe improvements on \textbf{Subtask B} (+0.1181) and \textbf{Subtask C} (+0.0120), indicating that repeated sampling with metric-guided selection improves generation quality both when passages are fixed (Subtask B) and when retrieval noise is present (Subtask C), though the end-to-end setting remains bottlenecked by retrieval imperfections.

\begin{table}[t]
\centering
\small
\begin{tabular}{l c c c}
\hline
\textbf{Subtask} & \textbf{Our Score} & \textbf{Baseline} & \textbf{Rank (Ours/Total)} \\
\hline
Subtask A & 0.5396 & 0.4795 & 5/38 \\
Subtask B & 0.7571 & 0.639 & 5/26\\
Subtask C & 0.5486 & 0.5366 & 7/29\\
\hline
\end{tabular}
\caption{Official test results compared against the organizers’ baseline. Rank indicates the position of our system on the official leaderboard for each subtask. Subtask A is evaluated using nDCG@5, while Subtasks B and C use the harmonic mean of $RB_{alg}$, $RL_F$, and $RB_{llm}$. See Appendix~\ref{sec:eval_metrics} for details.}
\label{tab:official_results}
\end{table}
\paragraph{Retrieval results}
Hybrid retrieval yields a substantial improvement over the baseline sparse-only configuration. In multi-turn settings, rewriting errors and domain heterogeneity often reduce lexical overlap between query and evidence; combining sparse and multiple dense retrievers mitigates this mismatch by improving recall across paraphrases and technical variants. Weighted RRF further stabilizes rankings by leveraging complementary signals from individual retrievers.

\paragraph{Generation results}
Our generation improvements are more pronounced in Subtask B than in Subtask C. Because Subtask B provides reference passages, gains primarily reflect improvements in generation robustness—specifically grounding behavior, answerability handling, and response appropriateness. In contrast, Subtask C remains constrained by upstream retrieval quality, limiting the magnitude of end-to-end gains.

The consistent improvement in Subtask B suggests that repeated sampling and metric-guided selection improves stability in conversational settings, particularly for underspecified or answerability-sensitive turns. In Subtask C, improvements are more modest but indicate that the generation strategy remains robust even under imperfect retrieval.

\subsection{Ablation Study of Retrieval Refinements}

\begin{table}[t]
\centering
\small
\begin{tabular}{l c}
\hline
\textbf{Configuration} & \textbf{Average nDCG@5} \\
\hline
Raw corpus & 0.3293 \\
+ Cleaning & 0.3332 \\
+ Summary augmentation & 0.3425 \\
+ Query prefixing & 0.3921 \\
+ Reranking & 0.424 \\
+ Fine-tuned reranking & \textbf{0.443} \\
\hline
\end{tabular}
\caption{Incremental retrieval improvements on the development split using Qwen3-8B-Embedding index. Each row cumulatively adds the listed modification.}
\label{tab:retrieval_ablation}
\end{table}
We quantify the impact of retrieval refinements on the development split by progressively adding: (i) corpus cleaning, (ii) passage-level summary augmentation, (iii) model-specific query prefixing for embedding retrievers, and (iv) reranking with a LoRA-tuned reranker.

Table~\ref{tab:retrieval_ablation} reports development nDCG@5 under these incremental modifications. Each component yielded measurable improvements. Corpus cleaning reduced lexical noise and improved stability across domains. Summary augmentation strengthened dense semantic representations. Query prefixing improved embedding alignment for instruction-tuned models. Fine-tuned reranking further improved ranking quality in certain domains.

\subsection{Domain Effects of Reranking}
As shown in Table~\ref{tab:domain_rerank}, we observed that reranking improves performance in Cloud but degrades performance in FiQA and Govt and has negligible effect on ClapNQ. This motivates our domain-adaptive strategy, applying reranking only for Cloud.

We attribute this behavior to two factors. First, weighted RRF already produces strong first-stage rankings by integrating sparse and dense signals, leaving limited headroom for reranking in domains where lexical and semantic alignment is already high. Second, corpus characteristics differ substantially across domains (Appendix~\ref{sec:eda}). Cloud contains longer, more repetitive technical documentation (lower lexical diversity), where semantic reranking can better separate structurally similar passages. In contrast, FiQA and ClapNQ contain shorter, more lexically diverse passages, and Govt contains substantial noise; in these cases, reranking can amplify spurious semantic matches or reduce robustness to boilerplate.

\begin{table}[t]
\centering
\small
\begin{tabular}{l c c}
\hline
\textbf{Domain} & \textbf{Pre-reranked RRF} & \textbf{Post-reranked RRF} \\
\hline
FiQA & 0.4527 & 0.4191 \\
Govt & 0.5114 & 0.4792 \\
ClapNQ & 0.5420 & 0.5417 \\
\textbf{Cloud} & \textbf{0.4188} & \textbf{0.4421} \\
\hline
\end{tabular}
\caption{Development nDCG@5 comparison of RRF using original ranked lists vs reranked lists on the development dataset.}
\label{tab:domain_rerank}
\end{table}

\subsection{Error Analysis}

Manual inspection reveals two recurring issues. First, underspecified queries that implicitly required clarification were typically answered using available evidence rather than prompting for clarification. Second, generated responses were often more verbose than ground-truth references, which may negatively affect $RB_{alg}$ despite factual correctness. Introducing explicit length control could improve alignment with reference-style answers.

Most errors were attributable to retrieval noise rather than clear hallucination, though we did not conduct a dedicated hallucination audit.

\section{Conclusion}
We presented a robustness-oriented multi-turn RAG system for SemEval-2026 Task 8 (MTRAGEval) integrating conversational query rewriting, hybrid retrieval with weighted RRF, domain-adaptive reranking, and metric-guided repeated generation. The system consistently outperforms official baselines across all subtasks, highlighting the importance of mitigating error propagation in conversational pipelines.

Future work includes feedback-driven retrieval refinement, task-specific generation fine-tuning, and improved structured retrieval strategies.


% Bibliography entries for the entire Anthology, followed by custom entries
%\bibliography{anthology,custom}
% Custom bibliography entries only
\bibliography{custom}

\appendix
% \section{Example Appendix}
% \label{sec:appendix}
\section{Corpus Exploratory Data Analysis}
\label{sec:eda}

We conducted exploratory data analysis on the passage-level corpora provided by the organizers to better understand domain-specific characteristics and potential retrieval challenges.

\subsection{Qualitative analysis}

\subsubsection{General Observations}

All corpora are provided at the passage level, with relevance judgments defined over passage identifiers. Each passage contains metadata fields such as title and/or URL depending on the domain. Titles, when present, are prefixed to the passage text.

\subsubsection{ClapNQ (Wikipedia-based QA)}

The ClapNQ corpus contains titles but no valid URLs. Content is largely educational and scientific, covering topics such as normal distributions, circular motion, Planck’s law, Lagrangian mechanics, radioactive decay, and electromagnetic radiation. Many passages include mathematical expressions, Unicode characters (e.g., Greek symbols), and multilingual text fragments (e.g., Japanese, Arabic, Hindi, Hebrew, Cherokee scripts). Some passages contain flattened tables (e.g., racing results) and formula-heavy text, which may affect lexical matching.

\subsubsection{FiQA (financial advice from StackExchange)}

The FiQA corpus does not include URLs or titles and consists of relatively short passages derived from financial discussion forums. The language is conversational and informal, frequently containing slang, profanity (e.g., abbreviated or partially masked terms), first-person perspectives, and opinionated statements. We observed occasional empty passages and sparse mathematical expressions. The informal style introduces variability that may impact both sparse and dense retrieval behavior.

\subsubsection{Cloud (technical documentation from cloud services)}

The Cloud corpus includes valid URLs but no titles. Content primarily consists of cloud CLI documentation and technical instructions. Passages contain numerous newline characters, hyphenated command flags, unstructured lists, and in some cases numeric-only segments likely derived from chart data. Repeated artifacts such as SVG icon URLs and template markers were observed, necessitating targeted cleaning.

\subsubsection{Govt (web-crawled content from selected .gov and .mil domains)}

The Govt corpus contains both valid URLs and titles and is the noisiest among the four domains. Content includes structured court records, privacy policies, and administrative documents. We observed extensive HTML remnants, pagination markers (e.g., repeated page indicators), long web archive URLs, and multilingual Unicode text (including Korean, Hebrew, Vietnamese, Urdu, and Russian scripts). These artifacts increase lexical noise and motivated more aggressive preprocessing.

\subsubsection{Implications for Retrieval}

The four domains exhibit substantial heterogeneity in writing style, structure, and noise patterns. ClapNQ is formula-heavy and multilingual, FiQA is informal and conversational, Cloud is technical and semi-structured, and Govt contains significant web crawl artifacts. These differences motivated our hybrid retrieval design combining sparse and dense representations, domain-adaptive reranking, and preprocessing steps to mitigate noise.

\subsection{Quantitative analysis}
\begin{figure}[t]
\centering
\includegraphics[width=\columnwidth]{latex/appendix_ttr_annotated.png}
\caption{Lexical variability (TTR) across domains. Cloud and Govt exhibit lower diversity than ClapNQ and FiQA.}
\label{fig:ttr}
\end{figure}

\begin{figure}[t]
\centering
\includegraphics[width=\columnwidth]{latex/appendix_wordcount_annotated.png}
\caption{Chunk length (word count) statistics across domains. Cloud and Govt passages are approximately twice as long as ClapNQ, reflecting longer technical and administrative document structure.}
\label{fig:wordcount}
\end{figure}
We measure lexical diversity using the Type–Token Ratio (TTR), defined as the ratio of unique tokens to total tokens in a passage.
Cloud and Govt passages are substantially longer (mean 233 and 258 tokens respectively) compared to ClapNQ (103 tokens) and FiQA (133 tokens). 
Lexical diversity, measured via unique-word ratio (TTR), is considerably lower in Cloud (0.43) and Govt (0.49) than in ClapNQ and FiQA ($\approx 0.70$). 

These results indicate that Cloud contains longer and more lexically repetitive technical documentation, where semantic reranking provides stronger disambiguation. In contrast, ClapNQ and FiQA contain shorter, lexically diverse passages where hybrid retrieval already achieves strong alignment.

\section{Additional hyperparameters}
\subsection{Retrieval}
\label{sec:hypa_retrieval}
Retrieval hyperparameters for RRF were tuned using grid search on the development split to select optimal retriever weights and the smoothing parameter $k$, targeting nDCG@5. The smoothing parameter was searched over $k \in {5, 10}$.

Retriever-specific weights were searched over the following ranges:
\begin{itemize}
    \item SPLADE-v3 ($w_S$): $[0.0, 3.0]$ in increments of 0.25
    \item Qwen3-Embedding-8B ($w_Q$): $[0.0, 2.0]$ in increments of 0.25
    \item NV-Embed-v2 ($w_N$): $[0.0, 2.0]$ in increments of 0.25
    \item text-embedding-3-large ($w_O$): $[0.0, 3.0]$ in increments of 0.25
\end{itemize}

Weights for SPLADE-v3 and text-embedding-3-large embeddings were searched over a larger range based on preliminary experiments indicating stronger standalone retrieval performance relative to Qwen3-Embedding-8B and NV-Embed-v2 embeddings.

Grid search was conducted separately for (i) fused ranked lists prior to reranking and (ii) reranked outputs. Due to computational constraints, we did not jointly optimize fusion weights across both stages.

\subsection{Reranking}
\label{sec:hypa_reranker}
For reranker training (Qwen3-Reranker-8B), we constructed 1,665 query-document supervision instances from rewritten queries, annotated positive contexts, and SPLADE-v3 sparse retrieval results. Positive chunks were taken from ground-truth contexts, while hard negatives were mined after excluding positive chunk IDs using rank-stratified sampling:
\begin{itemize}
    \item 2 from top 10
    \item 3 from ranks 11--50
    \item 3 from ranks 51--100
\end{itemize}
During training, we capped each instance to at most 1 positive and 6 negatives, yielding multiple positive--negative pairs per query and effectively increasing supervision. We formulate reranking as a pointwise binary classification problem, where each query--chunk pair is independently labeled as relevant or non-relevant, and optimize a binary cross-entropy loss. We held out 16\% of the data for validation and employed LoRA for fine-tuning, with hyperparameters summarized in Table~\ref{tab:reranker_hparams}.

\begin{table}[h]
\centering
\small
\begin{tabular}{l c}
\hline
\textbf{Parameter} & \textbf{Value} \\
\hline
Learning rate & $6\times10^{-6}$ \\
Epochs & 3 \\
LoRA rank & 8 \\
LoRA $\alpha$ & 32 \\
Gradient accumulation & 16 \\
Mixed precision & BF16 \\
\hline
\end{tabular}
\caption{LoRA fine-tuning hyperparameters for Qwen3-Reranker-8B.}
\label{tab:reranker_hparams}
\end{table}

\subsection{Evaluation Metrics}
\label{sec:eval_metrics}
Subtask A (Retrieval) was evaluated using normalized Discounted Cumulative Gain at rank 5 (nDCG@5).

Subtasks B and C were evaluated using the harmonic mean of three metrics defined by the organizers \citep{MTRAGEvalProposal2026}:

\begin{itemize}
    \item $RB_{alg}$: harmonic mean of BERTScore Recall, ROUGE-L, and BERT-K Precision,
    \item $RB_{llm}$: reference-based LLM judge score,
    \item $RL_F$: faithfulness score measuring grounding with respect to retrieved passages.
\end{itemize}

Evaluation is conditioned on answerability classification via an “I Don’t Know” (IDK) judge as described in the task overview papers.

\subsection{Implementation Details}

Sparse retrieval was implemented using SPLADE-v3.
Sparse retrieval, dense retrieval and reranking models were obtained from HuggingFace Transformers.\footnote{https://huggingface.co/transformers}  
LoRA fine-tuning was implemented using parameter-efficient training utilities.\footnote{https://github.com/huggingface/peft}  
Generation was performed using GPT-5 via the Azure OpenAI API.

\section{Ablation study details}
\subsection{Retrieval}
\label{sec:retrieval_refinements}
To quantify the contribution of individual retrieval refinements, we conducted controlled experiments on the development split, progressively modifying the retrieval pipeline.

We compare the following configurations:
\begin{enumerate}
    \item Raw corpus indexing
    \item Cleaned corpus (Unicode normalization and boilerplate removal)
    \item Cleaned corpus with summary augmentation
    \item Query prefixing for embedding models
    \item Domain-adaptive reranking
\end{enumerate}

\paragraph{Query Prefixing}
For embedding-based retrieval, we applied model-specific query prefixes to better align query representations with training objectives. For NVIDIA embeddings, we used:

\begin{quote}
\small
\texttt{Instruct: Given a search query, retrieve relevant passages that answer the query.\\
Query: <query>}
\end{quote}

For other embedding models, we applied:

\begin{quote}
\small
\texttt{Represent this sentence for searching relevant passages: <query>}
\end{quote}

Query prefixing yielded consistent improvements in dense retrieval performance, suggesting better alignment between query encoding and model pretraining instructions.

\subsection{Model performance comparison}

\begin{table}[t]
\centering
\footnotesize
\setlength{\tabcolsep}{4pt}
\begin{tabular}{lcccc}
\hline
\textbf{Domain} & \textbf{NV-emb} & \textbf{Qwen3-emb} & \textbf{text-emb} & \textbf{SPLADE} \\
\hline
FiQA   & 0.389 & 0.395 & 0.407 & 0.404 \\
Govt   & 0.453 & 0.437 & 0.453 & 0.465 \\
ClapNQ & 0.536 & 0.533 & 0.535 & 0.537 \\
Cloud  & 0.422 & 0.407 & 0.421 & 0.425 \\
\hline
\end{tabular}
\caption{Development nDCG@5 of individual retrievers across domains prior to fusion.}
\label{tab:retrieval_model_comparison}
\end{table}

Table~\ref{tab:retrieval_model_comparison} reports development nDCG@5 scores of individual retrievers prior to fusion. Performance varies across domains, reflecting differences in corpus characteristics.

SPLADE-v3 achieves the strongest performance in Govt and ClapNQ, indicating the effectiveness of sparse lexical matching in domains containing formal or entity-heavy content. In FiQA, dense embedding models slightly outperform SPLADE-v3, likely due to the conversational and semantically varied nature of financial discussion data.

In the Cloud domain, performance differences among retrievers are relatively small, suggesting that no single retrieval paradigm dominates. This further motivates the use of weighted RRF to combine complementary signals.

\section{Prompt Design}
\subsection{Generation prompt}
\label{sec:gen_prompt}

Due to space constraints, we summarize the key components of the system prompt used for generation in Subtask B and C rather than reproducing it in full.

The prompt enforces strict grounding and answerability behavior in a RAG setting through the following mechanisms:

\paragraph{Knowledge Restriction.}
The model is instructed to treat retrieved passages as its entire knowledge base for factual, explanatory, and procedural queries. It is explicitly prohibited from introducing external knowledge or inferring unstated facts.

\paragraph{Answerability Classification.}
For informational queries, the model must classify the request as fully answerable, partially answerable, or unanswerable based solely on retrieved evidence.  
Unanswerable cases require a natural refusal (e.g., ``I do not have enough information'').  
Partially answerable cases must clearly separate supported and unsupported content.

\paragraph{Message-Type Conditioning.}
The prompt includes behavioral rules for different query types (factoid, summarization, explanation, comparative, troubleshooting, conversational). Informational responses must be grounded, while purely social turns may use generic conversational language without introducing new facts.

\paragraph{Multi-Turn Constraints.}
The model is instructed to answer only the latest user turn, using prior dialogue history solely for contextual grounding (e.g., pronoun resolution).

\paragraph{Metric Alignment.}
The prompt design explicitly targets faithfulness and answerability, aligning generation behavior with the official evaluation metrics used in MTRAGEval.

\subsection{LLM-as-a-Judge prompt}
\label{sec:eval_prompt}
In order to enforce contextually appropriate answers, we used the following prompt to compute the LLM-as-a-Judge Score.
\begin{lstlisting}[basicstyle=\ttfamily\scriptsize]
You should act as a judge and evaluate the given answer against the user question and provided contexts. 

Scoring: Rate the answer strictly on a continuous scale from 0.0 to 1.0 where: 
1.0 -> fully faithful to the document, appropriate to the question, and complete 
0.0 -> unfaithful, irrelevant, hallucinated, or unsupported 
0.25 -> weak, partially hallucinated or incomplete 
0.5 -> moderately correct but missing details or partially faithful 
0.75 -> mostly correct, minor omissions or minor irrelevancies 

You MUST output ONLY the final rating number. No explanation, no sentences, no extra text. Just the number.
\end{lstlisting}
\end{document}