\documentclass[11pt]{article}

% Change "review" to "final" to generate the final (sometimes called camera-ready) version.
% Change to "preprint" to generate a non-anonymous version with page numbers.
\usepackage[final]{acl}

% Standard package includes
\usepackage{times}
\usepackage{latexsym}
\usepackage{amsmath}
\usepackage{tabularx}
\usepackage{booktabs}
\usepackage{float}
\usepackage{comment}
\usepackage{multirow}
% For proper rendering and hyphenation of words containing Latin characters (including in bib files)
\usepackage[T1]{fontenc}
% For Vietnamese characters
% \usepackage[T5]{fontenc}
% See https://www.latex-project.org/help/documentation/encguide.pdf for other character sets

% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}

% This is not strictly necessary, and may be commented out,
% but it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage{microtype}

% This is also not strictly necessary, and may be commented out.
% However, it will improve the aesthetics of text in
% the typewriter font.
\usepackage{inconsolata}

%Including images in your LaTeX document requires adding
%additional package(s)
\usepackage{graphicx}

% If the title and author information does not fit in the area allocated, uncomment the following
%
%\setlength\titlebox{<dim>}
%
% and set <dim> to something 5cm or larger.

\title{
DualAxis AI at SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis}

% Author information can be set in various styles:
% For several authors from the same institution:
%\author{Author 1 \and ... \and Author n \\
%         Address line \\ ... \\ Address line}
% if the names do not fit well on one line use
%         Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \And  ... \And
%         Author n \\ Address line \\ ... \\ Address line}
% To start a separate ``row'' of authors use \AND, as in
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \AND
%         Author 2 \\ Address line \\ ... \\ Address line \And
%         Author 3 \\ Address line \\ ... \\ Address line}


\author{
  \textbf{Yahya Missaoui}$^1$, 
  \textbf{Solomon Kebede}$^1$, 
  \textbf{Mounika Marreddy}$^1$, 
  \textbf{Alexander Mehler}$^1$ \\
  $^1$Goethe University, Frankfurt am Main, Germany \\
  \texttt{\small missaoui@stud.uni-frankfurt.de, solo.kebede@stud.uni-frankfurt.de} \\
  \texttt{\small mmarredd@em.uni-frankfurt.de, mehler@em.uni-frankfurt.de}
}

%\author{
%  \textbf{First Author\textsuperscript{1}},
%  \textbf{Second Author\textsuperscript{1,2}},
%  \textbf{Third T. Author\textsuperscript{1}},
%  \textbf{Fourth Author\textsuperscript{1}},
%\\
%  \textbf{Fifth Author\textsuperscript{1,2}},
%  \textbf{Sixth Author\textsuperscript{1}},
%  \textbf{Seventh Author\textsuperscript{1}},
%  \textbf{Eighth Author \textsuperscript{1,2,3,4}},
%\\
%  \textbf{Ninth Author\textsuperscript{1}},
%  \textbf{Tenth Author\textsuperscript{1}},
%  \textbf{Eleventh E. Author\textsuperscript{1,2,3,4,5}},
%  \textbf{Twelfth Author\textsuperscript{1}},
%\\
%  \textbf{Thirteenth Author\textsuperscript{3}},
%  \textbf{Fourteenth F. Author\textsuperscript{2,4}},
%  \textbf{Fifteenth Author\textsuperscript{1}},
%  \textbf{Sixteenth Author\textsuperscript{1}},
%\\
%  \textbf{Seventeenth S. Author\textsuperscript{4,5}},
%  \textbf{Eighteenth Author\textsuperscript{3,4}},
%  \textbf{Nineteenth N. Author\textsuperscript{2,5}},
%  \textbf{Twentieth Author\textsuperscript{1}}
%\\
%\\
%  \textsuperscript{1}Affiliation 1,
%  \textsuperscript{2}Affiliation 2,
%  \textsuperscript{3}Affiliation 3,
%  \textsuperscript{4}Affiliation 4,
%  \textsuperscript{5}Affiliation 5
%\\
%  \small{
%    \textbf{Correspondence:} \href{mailto:email@domain}{email@domain}
%  }
%}
% for submitting: Mounika Marreddy, third name
%Alexander Mehler, fourth name
\begin{document}
\maketitle
\begin{abstract}
\iffalse
Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends classical ABSA by predicting continuous affective signals rather than discrete polarity labels. In SemEval-2026 Task~3, systems must infer valence and arousal scores for aspects mentioned in a text, either for given aspects (DimASR) or jointly with extracting aspect--opinion structures (DimASTE/DimASQP). This setup requires both fine-grained localization of sentiment expressions and reliable regression of two real-valued dimensions.
In this paper, we benchmark transformer-based baselines for the DimABSA subtasks. We follow the official JSONL formats and evaluate with the task metrics, reporting results on the provided development splits. Our aim is to provide a simple, reproducible reference system and a clear empirical comparison of encoder backbones under identical training and decoding settings.

This paper presents our systems for SemEval 2026 Task 3 on Dimensional Aspect Based Sentiment Analysis. Unlike traditional aspect based sentiment analysis, which predicts discrete polarity labels, this task requires models to estimate continuous valence and arousal scores for each aspect mentioned in a text. The challenge therefore lies not only in identifying relevant aspects and opinions, but also in accurately modeling fine grained affective intensity.

We address all three subtasks defined in the shared task. For aspect level valence and arousal regression, we fine tune pretrained transformer encoders to predict continuous scores for given aspects. For the structured prediction subtasks, we adopt a query based extraction framework that identifies aspect and opinion spans and subsequently predicts their corresponding valence and arousal scores, as well as aspect categories where required. All models are trained and evaluated using the official data splits and evaluation metrics.

Our experimental results on the development sets show that larger pretrained encoders consistently improve regression performance, while structured extraction with continuous scoring remains substantially more challenging. The proposed systems provide simple, reproducible baselines and offer a clear reference point for future research on dimensional aspect based sentiment analysis.
\fi

Dimensional Aspect-Based Sentiment Analysis models sentiment using continuous valence and arousal scores instead of discrete polarity labels, enabling fine-grained affect representation at the aspect level. SemEval 2026 Task 3 defines this setting through three subtasks covering aspect-level regression and structured extraction of aspect–opinion pairs with continuous scoring. We implement transformer-based baselines for all subtasks within a unified, reproducible framework. For aspect-level regression, we fine-tune pretrained encoders in an aspect-conditioned setup to predict valence and arousal. RoBERTa-large achieves the best development performance, with average RMSEs of 0.884 (restaurant) and 0.789 (laptop).

For the structured subtasks, we adopt an instruction-based sequence-to-sequence generation approach using Flan-T5. The model generates triplets or quadruplets in a canonical textual format, thereby jointly producing aspect terms, opinion terms, valence–arousal scores, and, for Subtask~3, aspect categories. Our best model attains continuous F1 scores of 0.742 and 0.648 for triplet extraction, and 0.604 and 0.385 for quadruplet extraction on the restaurant and laptop domains, respectively. Results show that continuous aspect-level regression is relatively stable under standard fine-tuning, whereas jointly extracting structured elements and predicting continuous affect remains considerably more challenging. Our systems provide reproducible baselines under the official evaluation protocol for future work on dimensional aspect-based sentiment analysis.

\end{abstract}


\section{Introduction}
\label{sec:intro}
Sentiment toward specific aspects of a product or service is often expressed with varying emotional intensity. Traditional aspect-based sentiment analysis typically reduces this variation to discrete labels such as positive or negative, overlooking differences in strength or activation. Modeling sentiment along continuous affective dimensions provides a more expressive representation of aspect-level opinions.

Shifting from classification to continuous prediction changes the learning problem. Models must estimate calibrated real-valued scores rather than select from fixed categories. The challenge increases in structured settings, where systems must first identify aspect and opinion spans and then assign appropriate affective values; extraction errors directly affect score quality.

SemEval 2026 Task 3 provides a unified benchmark for studying these challenges by evaluating aspect-level regression and structured extraction under the same dimensional framework. This setting enables systematic analysis of how modern pretrained encoders perform on regression and structured prediction when affective intensity is modeled explicitly.

Although transformer-based models have shown strong results in sentiment classification, their behavior under continuous supervision remains less explored, particularly when span identification and score prediction are combined. In this work, we present controlled and reproducible baseline systems for all subtasks. By maintaining consistent data processing and decoding procedures across encoder backbones, we isolate the impact of architectural choices on dimensional sentiment modeling. Our systems are intended to serve as reference baselines for future research.

Our main contributions are threefold. First, we evaluate encoder-based aspect-conditioned regression models for Subtask~1 across the restaurant and laptop domains. Second, we formulate Subtasks~2 and~3 as instruction-based Flan-T5 generation tasks that directly produce structured triplets and quadruplets. Third, we provide development-set results, training diagnostics, codes and an error-oriented analysis to support reproducibility and facilitate future comparisons. The codes are publicly available at \url{https://github.com/SolomonM-Kebede/ProjectNLP-DimABSA2026}.


\iffalse

Aspect-based sentiment analysis (ABSA) is typically evaluated with discrete polarity labels, yet many opinions differ not only in polarity but also in intensity and emotional activation. Predicting continuous affective dimensions makes the output more expressive, but it also changes the learning problem: models must produce calibrated real-valued scores and, for structured variants, align those scores with the correct textual evidence.

Although pretrained transformers are a strong default for ABSA, their behavior under dimensional supervision is less well-characterized, especially when regression and structured extraction are combined in a single evaluation setting. As a result, it is useful to establish compact baselines that separate architectural choices (encoder backbone and prediction heads) from task-specific heuristics.

This paper provides transformer baselines for the SemEval-2026 DimABSA subtasks using a unified, reproducible pipeline. We keep preprocessing and decoding fixed and vary the encoder backbone and lightweight heads, enabling a controlled comparison under the official input/output formats and metrics. The resulting system is intended as a reference point that can be replicated and extended with minimal effort.

The remainder of the paper is organized as follows: \autoref{sec:related} summarizes related work, \autoref{sec:task-data} describes the task and dataset, \autoref{sec:method} presents the modeling approach, \autoref{sec:exp} details the experimental setup, \autoref{sec:results} reports results, and \autoref{sec:discussion} discusses findings and limitations.
\fi
\section{Related Work}
\label{sec:related}
\paragraph{Dimensional sentiment modeling.}
Research in affective psychology has long argued for modeling emotions along continuous dimensions rather than discrete categories. The circumplex model of affect \cite{russell1980circumplex,russell2003coreaffect} formalizes emotion primarily in terms of valence and arousal, a representation that has since been adopted in NLP. Lexical resources such as the NRC VAD Lexicon \citep{mohammad2018nrcvad} and sentence-level datasets like EmoBank \cite{buechel2017emobank} operationalize this framework by providing real-valued affective annotations. These resources have driven regression-based sentiment models that capture fine-grained affective intensity, moving beyond coarse polarity classification.

\paragraph{Aspect-based sentiment analysis and structured prediction.}
Aspect-based sentiment analysis (ABSA) focuses on identifying sentiments expressed toward specific targets, evolving from pipeline-based approaches to joint structured prediction. Survey work \cite{zhang2022absaSurvey} documents this progression, highlighting formulations such as Aspect Sentiment Triplet Extraction (ASTE) \cite{peng2020aste} and Aspect Sentiment Quad Prediction (ASQP) \cite{zhang2021asqpParaphrase}. These tasks require jointly modeling multiple interdependent components—aspect terms, opinion terms, sentiment labels, and optionally aspect categories—substantially increasing modeling complexity compared to sentence-level sentiment classification.

\paragraph{Query-based extraction for structured ABSA.}
To address the challenges of joint prediction, many recent approaches cast structured ABSA as a machine reading comprehension (MRC) problem. In this paradigm, models extract aspect and opinion spans by answering task-specific natural language queries, followed by sentiment or category classification \citep{chen2021bmrc,gao2021questiondriven}. Multi-turn and bidirectional querying strategies help mitigate error propagation and improve coverage in structured settings. Our DimASTE and DimASQP baselines adopt this query-based framework and extend it by predicting continuous valence and arousal scores alongside structured sentiment elements.

\paragraph{Pretrained transformers and generative baselines.}
Pretrained transformer models serve as the backbone for most modern ABSA systems due to their strong contextual representations. Encoder-based architectures such as BERT \cite{devlin2019bert} and RoBERTa \citep{liu2019roberta} remain competitive under carefully controlled training and decoding conditions. In parallel, sequence-to-sequence models such as T5 \cite{raffel2020t5} and instruction-tuned variants like Flan \cite{chung2022flan} enable generative formulations that produce structured outputs as text. Comparing encoder-based and generative approaches under a unified evaluation protocol provides insights into their relative strengths for dimensional and structured sentiment modeling.

\paragraph{Dimensional aspect-based sentiment shared tasks.}
Recent shared tasks have formalized the integration of dimensional affect modeling with ABSA. SemEval 2026 Task 3 introduces Dimensional ABSA (DimABSA), requiring prediction of continuous valence and arousal scores either for given aspects (DimASR) or jointly with structured sentiment extraction (DimASTE/DimASQP) \citep{semeval2026dimabsa}. Related efforts, such as the SIGHAN 2024 shared task on Chinese Dimensional ABSA \citep{sighan2024dimabsa}, similarly emphasize fine-grained affect intensity estimation within aspect-aware sentiment frameworks, underscoring the growing interest in continuous sentiment representations.




\section{Task and Dataset Description}
\label{sec:task-data}
\subsection{Task definition}
SemEval 2026 Task 3 defines DimABSA as the problem of predicting continuous valence and arousal scores for aspect level sentiment. The task is divided into three subtasks, each addressing a different prediction setting.
\textbf{Subtask 1 (DimASR).} Given a text and a predefined list of aspects, the system must predict a valence–arousal (VA) score for each aspect. This subtask focuses on aspect level regression.
\textbf{Subtask 2 (DimASTE).} Given a text without predefined aspects, the system must extract all aspect–opinion–score triplets $(A, O, VA)$. Here, $A$ denotes the aspect term, $O$ denotes the opinion term, and $VA$ represents the corresponding valence–arousal score.
\textbf{Subtask 3 (DimASQP).} Given a text, the system must extract all quadruplets $(A, O, C, VA)$, where $C$ denotes the aspect category in addition to the aspect term, opinion term, and valence–arousal score.
Together, these subtasks evaluate both regression and structured extraction under a unified dimensional sentiment framework.
Abbreviations used throughout the paper are summarized in Table 5, Appendix~\ref{app:abbr}.





\subsection{Data format}
All datasets are released in JSONL format, where each line corresponds to a single instance containing a unique \texttt{ID} and a \texttt{Text} field. For DimASR, the input additionally includes an \texttt{Aspect} list. The system must output an \texttt{Aspect\_VA} field, which contains one predicted VA score for each given aspect. For DimASTE, the output consists of a \texttt{Triplet} list. Each triplet includes the fields \texttt{Aspect}, \texttt{Opinion}, and \texttt{VA}. For DimASQP, the output is a \texttt{Quadruplet} list that includes \texttt{Aspect}, \texttt{Opinion}, \texttt{Category}, and \texttt{VA}. In the released data, implicit or unmentioned spans are represented using the literal string \texttt{NULL}, for example \texttt{Aspect = NULL}.
The VA label is represented as a string in the format \texttt{V\#A}, where the first value corresponds to valence and the second to arousal. Each dimension takes a value between $1.00$ and $9.00$ and is rounded to two decimal places. For example, a valid label may appear as \texttt{6.75\#6.38}. For model selection during development, we use the provided training data and reserve 10\% as an internal validation split where required, resulting in a 90/10 training–-validation split. Final results on the development set are reported using the official development files and evaluation scripts.

\subsection{Evaluation}
DimASR is evaluated using Root Mean Squared Error (RMSE), computed between the predicted and gold valence and arousal scores. This metric measures the average deviation of the predicted continuous values from the reference annotations. DimASTE and DimASQP are evaluated using the official continuous F1 (cF1) metric. This metric rewards correct extraction of structured sentiment elements while also incorporating the distance between predicted and gold VA scores. All experiments are conducted using the official evaluation scripts and the provided training and development splits. The test labels are not publicly available.

\section{Method}
\label{sec:method}
In this section, we describe the transformer-based models developed for SemEval-2026 Task~3. 
Rather than relying on a single architecture for all subtasks, we employ a hybrid modeling approach tailored to the requirements of each subtask. 
For Subtask~1, we formulate the problem as aspect-conditioned valence--arousal (VA) regression, where the model predicts continuous VA scores for each given aspect. 
We use discriminative models based on pretrained BERT and RoBERTa encoders, which are well suited for capturing global sentence-level semantics and mapping them to continuous affective dimensions.

For Subtasks~2 and~3, corresponding to DimASTE and DimASQP, we adopt a generative query-based extraction framework based on FLAN-T5~\cite{chung2022flan}. 
Instead of using a discriminative pipeline with separate span-detection, regression, and classification heads, we treat the task as instruction-based generation. 
The encoder receives a task-specific query together with the review text, while the decoder acts as a unified multi-task prediction component. 
It generates a structured JSON sequence representing aspect spans, opinion spans, VA values, and, for Subtask~3, aspect categories. 
This design is inspired by the generative aspect-based sentiment analysis framework of Li et al. and allows the model to capture dependencies between aspects, opinions, categories, and sentiment dimensions within a single generation process.



\subsection{DimASR: aspect-conditioned VA regression}
\label{sec:method_dimasr}
For Subtask~1, we construct one training instance for each annotated aspect. 
From every JSONL entry in the training data, we extract the review text and iterate over its labeled units. 
For each aspect, we pair the aspect string with its corresponding gold VA value. 
Each pair is transformed into a single input sequence by concatenating an aspect prompt with the review text:

\begin{center}
\texttt{aspect: <a> text: <x>}
\end{center}

Here, $a$ denotes the aspect and $x$ denotes the review text. The resulting sequence is tokenized using the RoBERTa tokenizer. Our regression model is built on a pretrained RoBERTa encoder. We apply masked mean pooling over the final layer token representations to obtain a fixed length sentence representation. This representation is passed to a multi layer feed forward prediction head with GELU activations and LayerNorm, which outputs two continuous values, $\hat{v}$ and $\hat{a}$, corresponding to predicted valence and arousal.
To stabilize training, the target scores are standardized using the mean and standard deviation computed from the training split. During inference, predictions are transformed back to the original scale for evaluation and submission. The model is trained using Huber loss applied to the two-dimensional regression target. At inference time, we generate one input sequence for each text and aspect pair provided in the test data. For every aspect, the model outputs a predicted VA score pair in the required \texttt{Aspect\#VA} field.


\subsection{DimASTE/DimASQP: sequence-to-sequence structured generation}
\label{sec:method_dimaste}

For Subtasks~2--3, we model structured extraction with continuous scoring as a text-to-text generation problem
using an instruction-tuned encoder--decoder backbone (Flan-T5).
Given a review text, the model is prompted to generate all sentiment structures in a canonical textual format.
For DimASTE, the target consists of a list of triplets $(A, O, VA)$; for DimASQP, the target is a list of
quadruplets $(A, O, C, VA)$, where $C$ is the aspect category and $VA$ is emitted as \texttt{V\#A}.
We use a single model per subtask and domain and train with standard sequence-to-sequence cross-entropy loss
(negative log-likelihood) over the target tokens.

During preprocessing, gold annotations from the official JSONL files are converted into the corresponding target
string representation. At inference time, we decode with beam search and parse the generated text back into
structured records. We apply lightweight normalization (e.g., trimming whitespace and normalizing separators)
and deduplicate identical structures before writing predictions in the official \texttt{Triplet} or
\texttt{Quadruplet} JSONL output schema. Predicted VA values are taken directly from the generated \texttt{V\#A}
strings and are clipped to the valid range when necessary.


\subsection{Experimental Setup}
\label{sec:exp}
All experiments are conducted using the official train/dev splits and the evaluation scripts provided by the task organizers. For Subtask~1, the training data is expanded into individual $(\text{text}, \text{aspect})$ instances, where each aspect in a review forms a separate example. We use a 90/10 split of the training data for internal model selection where required. For Subtask~1, we fine-tune RoBERTa and BERT encoders using AdamW with learning rate scheduling, gradient clipping, and gradient accumulation, following the configuration specified in the task script. The best checkpoint is selected based on validation performance. The hyperparameter for Subtask~1 are provided in Table~\ref{tab:consolidated_hyperparams}. For Subtasks~2 and~3, we fine-tune Flan-T5 models for structured generation. Model checkpoints are selected based on development set performance. The corresponding hyperparameter for Subtask 2 and 3 are also provided in Table~\ref{tab:consolidated_hyperparams}. All experiments were conducted on Google Colab and Kaggle using NVIDIA Tesla T4 GPUs with approximately 16GB of VRAM.




\section{Results}
\label{sec:results}


We evaluate on the official test set for Subtasks~2--3, and on the
development set for Subtask~1.\footnote{Test-set labels for Subtask~1
(DimASR) were not distributed to participants. We therefore report
development-set performance and compare against the organizer baseline
on the same split. All Subtask~2 and~3 numbers are on the official
test set.}
DimASR is evaluated with RMSE and CCC; DimASTE and DimASQP use the
official continuous-F1 (cF1) metric.


\subsection{DimASR: Aspect-Level VA Regression}
\label{sec:results_dimasr}

Tables~\ref{tab:dimasr_restaurant} and~\ref{tab:dimasr_laptop} show
development results across backbone models.
RoBERTa-large achieves the best $\mathrm{RMSE}_{avg}$ in both domains
(0.884 restaurant; 0.789 laptop), confirming that larger encoder
capacity benefits continuous VA regression.

\begin{table}[H]
\centering\footnotesize\setlength{\tabcolsep}{3.5pt}
\renewcommand{\arraystretch}{1.0}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{l r r r r r}
\toprule
\textbf{Backbone} & $\mathbf{RMSE}_v$ & $\mathbf{RMSE}_a$
  & $\mathbf{RMSE}_{avg}$ & $\mathbf{CCC}_{avg}$ & \textbf{Ep.}\\
\midrule
RoBERTa-large     & 1.010 & 0.757 & \textbf{0.884} & \textbf{0.800} & 5 \\
BERT-base-uncased & 1.071 & 0.787 & 0.929 & 0.774 & 3 \\
BERT-base-cased   & 1.059 & 0.825 & 0.942 & 0.745 & 2 \\
RoBERTa-base      & 1.086 & 0.865 & 0.975 & 0.701 & 3 \\
\bottomrule
\end{tabular}}
\caption{DimASR dev results, restaurant domain.}
\label{tab:dimasr_restaurant}
\end{table}

\begin{table}[H]
\centering\footnotesize\setlength{\tabcolsep}{3.5pt}
\renewcommand{\arraystretch}{1.0}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{l r r r r r}
\toprule
\textbf{Backbone} & $\mathbf{RMSE}_v$ & $\mathbf{RMSE}_a$
  & $\mathbf{RMSE}_{avg}$ & $\mathbf{CCC}_{avg}$ & \textbf{Ep.}\\
\midrule
RoBERTa-large     & 0.829 & 0.748 & \textbf{0.789} & 0.778          & 2 \\
RoBERTa-base      & 0.847 & 0.786 & 0.816          & 0.775          & 3 \\
BERT-base-uncased & 0.877 & 0.760 & 0.818          & 0.764          & 2 \\
BERT-base-cased   & 0.907 & 0.746 & 0.826          & \textbf{0.785} & 2 \\
\bottomrule
\end{tabular}}
\caption{DimASR dev results, laptop domain.}
\label{tab:dimasr_laptop}
\end{table}


\subsection{DimASTE and DimASQP}
\label{sec:results_dimaste_dimasqp}

Table~\ref{tab:res_st23} reports test-set cF1 for both subtasks
alongside the organizer baseline.
Our system outperforms the baseline on all four subtask--domain
combinations.
For DimASTE, gains are +0.0582 (restaurant) and +0.0374 (laptop).
For DimASQP, we improve by +0.0378 on restaurant and by a large margin
of +0.3427 on laptop (0.5910 vs.\ 0.2483), suggesting that our
sequence-to-sequence approach handles the sparser, more technical
laptop vocabulary more robustly than the organizer baseline.
Notably, laptop DimASQP (0.5910) exceeds laptop DimASTE (0.5038),
which we attribute to the structured category vocabulary providing an
additional grounding signal absent in free-form triplet extraction.

\begin{table}[!t]
\centering\footnotesize\setlength{\tabcolsep}{3.5pt}
\begin{tabularx}{\columnwidth}{@{}l l r r r@{}}
\toprule
\textbf{Subtask} & \textbf{Domain / System}
  & \textbf{cF1} & \textbf{P} & \textbf{R} \\
\midrule
\multirow{4}{*}{DimASTE}
  & Rest.\ baseline  & 0.5442 & — & — \\
  & Rest.\ ours      & \textbf{0.6024} & 0.6319 & 0.5926 \\
  & Laptop baseline  & 0.4664 & — & — \\
  & Laptop ours      & \textbf{0.5038} & 0.5382 & 0.4942 \\
\midrule
\multirow{4}{*}{DimASQP}
  & Rest.\ baseline  & 0.5048 & — & — \\
  & Rest.\ ours      & \textbf{0.5426} & 0.5781 & 0.5292 \\
  & Laptop baseline  & 0.2483 & — & — \\
  & Laptop ours      & \textbf{0.5910} & 0.6011 & 0.5731 \\
\bottomrule
\end{tabularx}
\caption{DimASTE and DimASQP test-set results.
         Baseline = organizer system; P = Precision; R = Recall.}
\label{tab:res_st23}
\end{table}

\begin{figure}[!t]
\centering
\begin{minipage}{0.49\columnwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/t5_large_subtask2_laptop3.png}
\end{minipage}\hfill
\begin{minipage}{0.49\columnwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/t5_large_subtask2_laptop1.png}
\end{minipage}
\vspace{0.2em}
\begin{minipage}{\columnwidth}
  \centering
  \includegraphics[width=0.75\columnwidth]{figures/t5_large_subtask2_laptop2.png}
\end{minipage}
\caption{DimASTE diagnostics (\texttt{flan-t5-large}, laptop):
         training loss (top left), dev cF1 by epoch (top right),
         train vs.\ dev loss (bottom).}
\label{fig:dimaste_curves}
\end{figure}

\begin{figure}[!t]
\centering
\begin{minipage}{0.49\columnwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/subtask3_t5_diagram1.png}
\end{minipage}\hfill
\begin{minipage}{0.49\columnwidth}
  \centering
  \includegraphics[width=\linewidth]{figures/subtask3_t5_diagram2.png}
\end{minipage}
\vspace{0.2em}
\begin{minipage}{\columnwidth}
  \centering
  \includegraphics[width=0.75\columnwidth]{figures/subtask3_t5_diagram3.png}
\end{minipage}
\caption{DimASQP diagnostics (\texttt{flan-t5-large}, restaurant):
         step-wise loss (top left), train vs.\ dev loss (top right),
         dev cF1 by epoch (bottom).}
\label{fig:dimasqp_curves}
\end{figure}


\subsection{Analysis}
\label{sec:results_analysis}

Results confirm a consistent difficulty gradient from aspect-level regression to structured extraction. For DimASR, encoder size is the primary driver of performance. For DimASTE and DimASQP, all baseline comparisons favour our system.
Training curves (Figures~\ref{fig:dimaste_curves}--\ref{fig:dimasqp_curves})
show rapid convergence within two to three epochs across all settings.


\section{Discussion}
\label{sec:discussion}

\textbf{Seq2seq generalization.} Flan-T5-large's instruction-following pre-training enables robust generalization to the structured output format, particularly for technically sparse laptop vocabulary. The large DimASQP laptop gain (+0.3427) suggests the organizer baseline struggles with domain shift that our generation-based approach handles more gracefully.

\textbf{Category as grounding signal.} The reversal of the DimASTE--DimASQP ordering on the laptop domain
(0.5038 vs.\ 0.5910) indicates that the closed-set category
vocabulary (e.g., \texttt{BATTERY\#OPERATION\_PERFORMANCE}) acts as a
supervisory anchor, reducing the model's reliance on ambiguous
free-form opinion spans.

\textbf{Domain gap.} Restaurant reviews yield uniformly higher cF1 than laptop reviews for
DimASTE, consistent with the richer and more stereotyped sentiment
vocabulary of restaurant text.
The gap narrows or reverses for DimASQP, where the structured categories
partially compensate for the sparse expressions of laptop sentiment.

\textbf{Limitations and future work.} The three subtasks are currently solved independently; a joint model
sharing representations across DimASTE and DimASQP could reduce the
category-prediction bottleneck.
For DimASR, aspect-aware pooling beyond CLS-token regression may
further reduce RMSE.
Domain-adaptive pre-training remains a promising direction to close
the restaurant--laptop gap.




\nocite{*}
\bibliography{custom}

%\clearpage % This forces all previous Results/Discussion tables to finish
\appendix

\section{Appendix}
\label{sec:appendix}

\subsection{Abbreviations}
\label{app:abbr}

% Using [!ht] usually keeps it close to the text, 
% but inside the Appendix because of \clearpage above.
\begin{table}[!ht]
\centering
\small
\setlength{\tabcolsep}{4pt}
\begin{tabularx}{\columnwidth}{@{}lX@{}}
\toprule
\textbf{Abbrev.} & \textbf{Meaning} \\
\midrule
ABSA    & Aspect-Based Sentiment Analysis \\
DimABSA & Dimensional Aspect-Based Sentiment Analysis \\
VA      & Valence--Arousal score pair \\
Subtask 1  & Aspect-level VA regression \\
Subtask 2 &  $(A,O,VA)$ triplet extraction \\
Subtask 3 & $(A,O,C,VA)$ quadruplet extraction \\
JSONL   & JSON Lines (one JSON object per line) \\
RMSE    & Root Mean Squared Error \\
cF1     & Continuous-F1 (official extraction metric with VA distance) \\
\bottomrule
\end{tabularx}
\caption{Abbreviations used in this paper.}
\label{tab:abbr}
\end{table}

\subsection{Hyperparameters}
\label{app:hyperparams}

For Subtask~1 we utilized pretrained variants of RoBERTa and BERT, while for Subtasks~2 and~3 we used FLAN-T5 base and FLAN-T5 large. Table~\ref{tab:consolidated_hyperparams} details the specific configurations used for each.

\begin{table}[!ht]
\centering
\scriptsize
\setlength{\tabcolsep}{2pt}
\begin{tabularx}{\columnwidth}{@{}l XX@{}}
\toprule
\textbf{Parameter} & \textbf{Subtask 1} & \textbf{Subtasks 2 \& 3} \\
\midrule
Backbone Models & RoBERTa/BERT & FLAN-T5 (B/L) \\
Max Sequence Length & 256 tokens & 256 tokens \\
Batch Size & 3 & 1 \\
Training Epochs & 4 & 8 \\
Learning Rate & $2 \times 10^{-5}$ & $4 \times 10^{-5}$ \\
Weight Decay & 0.01 & --- \\
Dropout / Warm-up & 0.3 (Drop) & 0.1 (Warm-up) \\
Grad. Accumulation & 16 & 16 \\
Max Gradient Norm & 1.0 & 1.0 \\
Early Stopping & 3 epochs & 2 epochs \\
Mixed Precision & Enabled & --- \\
Grad. Checkpointing & --- & Enabled \\
Target Normalization & Enabled & --- \\
\bottomrule
\end{tabularx}
\caption{Consolidated hyperparameters for all subtasks.}
\label{tab:consolidated_hyperparams}
\end{table}


\end{document}
        