% !TEX program = pdflatex
\documentclass[11pt]{article}
\usepackage{acl}
\usepackage{latexsym}
\usepackage{url}
\usepackage{amsmath}
\usepackage{newtxtext,newtxmath}
\usepackage{graphicx}
\usepackage{tikz-cd}
\usepackage{booktabs}
\usepackage{microtype}
\usepackage{algorithm}
\usepackage{algpseudocode}

\begin{document}

\title{\textbf{dutir\_shlee at SemEval-2026 Task 11: Symbolic Augmentation for Content-Bias-Resistant Syllogistic Reasoning}}
\author{
Songhuan Li, Liang Yang$^{*}$, Shengdi Yin, Qiang Zhang, Hongfei Lin\\
School of Computer Science and Technology\\
Dalian University of Technology, China\\
\texttt{1341460619@mail.dlut.edu.cn, (liang,zhangq,hflin)@dlut.edu.cn}
}
\date{}
\maketitle

\begin{abstract}
We describe our system for SemEval-2026 Task 11 Subtask 1 (English syllogistic validity). Our approach fine-tunes Qwen2.5-7B-Instruct with LoRA and a symbolic data augmentation (SDA) scheme that replaces real-world entities with abstract placeholders, explicitly decoupling logical form from content. The resulting model achieves 96.34\% accuracy and a total content effect (TCE) of 2.15, yielding a primary score of 44.86. We provide detailed ablations and negative results (prompting, self-consistency, contrastive decoding, structured chain-of-thought, and DPO) to characterize why direct LoRA training with SDA is the most robust configuration for this task. Finally, we use a specialist--generalist complementarity setting where a strong API model (ACC 99.48, TCE 1.06, score 57.68) is corrected by the SDA specialist on a single disagreement, producing a merged output with ACC 100 and TCE 0.
\end{abstract}

\section{Introduction}
SemEval-2026 Task 11 targets content effects in logical reasoning by requiring models to judge syllogistic validity independently of plausibility. We participate only in Subtask 1, which is English-only binary validity classification. This task is important because belief bias is a persistent failure mode for LLMs in high-stakes reasoning settings, and the evaluation explicitly penalizes content-driven errors. The official task overview is given in \citep{valentino-etal-2026-semeval}.

Our main strategy is symbolic data augmentation (SDA) combined with parameter-efficient LoRA fine-tuning. By replacing concrete entities with abstract placeholders, SDA decouples logical form from surface content and forces the model to attend to quantifiers and negation structure. We then fine-tune a Qwen2.5-7B-Instruct base model using LoRA and run deterministic inference with a simple label-only output.

Through participation, we found that the LoRA+SDA system achieves 96.34\% accuracy with TCE 2.15 (score 44.86), while chain-of-thought prompting, self-consistency, and DPO did not improve bias robustness. Remaining errors concentrate on quantifier-scope and existential/universal edge cases. We did not track a shared-task leaderboard rank for this report.

\section{Background and Task Setup}
Each instance contains an English syllogism and a binary validity label. Inputs are short, controlled arguments in syllogistic form; outputs are \emph{Valid} or \emph{Invalid}. For example: ``All A are B. No B are C. Therefore, no A are C.'' $\rightarrow$ \emph{Valid}. We participate only in Subtask 1 (English binary classification). The official training set contains 960 items and the test split contains 191 items; we report results on the official splits without re-partitioning.

Performance is measured by overall accuracy (ACC) and total content effect (TCE), with the primary metric:
\[
\frac{\mathrm{ACC}}{1 + \ln(1+\mathrm{TCE})}.
\]
Lower TCE indicates stronger robustness to content bias. Prior work shows that LLMs systematically exhibit content effects on syllogisms \citep{dasgupta-etal-2022-content,bertolazzi-etal-2024-soft,valentino2025mitigating,kim-etal-2025-reasoning}. Related work evaluates deductive competence and syllogistic reasoning across settings and datasets \citep{seals-shalin-2024-deductive,ozeki-etal-2024-neubaroco,eisape-etal-2024-human-vs-llm,wysocka-etal-2025-syllobio}, and explores faithfulness and quasi-symbolic reasoning for chain-of-thought and explanations \citep{lyu-etal-2023-faithful-cot,xu-etal-2024-faithful-logical-cot,quan-etal-2024-verification,ranaldi-etal-2025-quasi}. Our system focuses on eliminating content cues through SDA, encouraging a representation aligned with abstract logical form rather than world knowledge.

The dataset is intentionally constructed to disentangle validity from plausibility, containing both believable yet invalid arguments and implausible yet valid ones; the metric explicitly penalizes bias through TCE.

\section{System Overview}
Our best system is a LoRA fine-tuned Qwen2.5-7B-Instruct model trained on a mix of original and symbolic training data. We target a low-parameter adaptation strategy to maximize reproducibility and efficiency, while explicitly reshaping the input distribution toward structural reasoning. The main challenges are (i) belief bias that favors plausibility over validity and (ii) limited training data (960 items). We address these by injecting symbolically augmented samples to decouple content from logic, and by using parameter-efficient adaptation to avoid overfitting.

\subsection{Key Design Decisions and System Pipeline}
We adopt QLoRA with attention-only adapters and greedy decoding to reduce memory while preserving instruction-following behavior. Our pipeline has three stages: (1) generate augmented training data with SDA; (2) fine-tune a LoRA adapter on the mixed corpus; and (3) run deterministic inference with the adapter and evaluate with the official script. This keeps training and inference aligned and enables clean ablations.

\subsection{Algorithmic Specification}
Let $x$ be a syllogism and $y \in \{\text{Valid}, \text{Invalid}\}$ its label. SDA defines a mapping $\phi(\cdot)$ that replaces content words with symbols while preserving quantifiers and negation. The training set becomes $\mathcal{D}^\prime = \mathcal{D} \cup \phi(\mathcal{D})$. We then optimize a LoRA-adapted model $\theta+\Delta$ by minimizing cross-entropy on $\mathcal{D}^\prime$.

\begin{algorithm}[t]
\caption{\textbf{LoRA + SDA Training}}
\begin{algorithmic}[1]
\Require Training set $\mathcal{D}$, base model $\theta$, symbol map $\phi$
\Ensure LoRA adapter $\Delta$
\State $\mathcal{D}_{\text{sda}} \leftarrow \{(\phi(x), y)\;|\;(x,y)\in\mathcal{D}\}$
\State $\mathcal{D}' \leftarrow \mathcal{D} \cup \mathcal{D}_{\text{sda}}$
\State Initialize LoRA adapters $\Delta$ on attention layers
\State Optimize $\theta+\Delta$ on $\mathcal{D}'$ with cross-entropy
\State Save $\Delta$ for inference
\end{algorithmic}
\end{algorithm}

\subsection{Symbolic Data Augmentation (SDA)}
We replace concrete entities with randomized symbolic placeholders (e.g., \emph{Wug}, \emph{Zarp}, \emph{A}, \emph{B}) while preserving logical form. This removes plausibility cues and prevents lexical overlap from being grounded in world knowledge. For example:
\begin{quote}
\small
Original: ``All dogs are mammals. No mammals are fish. Therefore, no dogs are fish.''\\
Augmented: ``All Wugs are Zips. No Zips are Mors. Therefore, no Wugs are Mors.''
\end{quote}
SDA is implemented with template-driven replacement of subject, predicate, and middle terms while preserving quantifiers and negation, and the augmented samples are mixed with the original training data.

\subsection{Concrete Example and Prompting Format}
Given ``All dogs are mammals. No mammals are fish. Therefore, no dogs are fish.'' (label \emph{Valid}), SDA produces ``All Wugs are Zips. No Zips are Mors. Therefore, no Wugs are Mors.'' Both forms are included in training; inference uses the original text and greedy decoding. We use a concise, instruction-style prompt that presents the syllogism and asks for a binary validity judgment. The output space is restricted to the two labels \emph{Valid} and \emph{Invalid}. We avoid chain-of-thought or intermediate rationales in the primary system because the LoRA adapter was trained to map directly from the full syllogism to the final label. This alignment between training and inference minimizes exposure bias and reduces variance in TCE.

\subsection{Why SDA Helps}
SDA reduces lexical priors and semantic anchoring by removing recognizable entities, forcing the model to rely on quantifiers and negation structure. In practice, we observed consistent TCE reductions when symbolic samples are included in training.

\subsection{Model and Training}
We fine-tune Qwen2.5-7B-Instruct using LoRA (QLoRA, 4-bit NF4). We adapt attention layers with a moderate rank (e.g., 16) and train on a mixture of original and symbolic samples. We use standard instruction-format prompts and train the model to output a single token (\emph{Valid}/\emph{Invalid}) without explicit reasoning. This ``direct intuition'' setup consistently yields the best balance between accuracy and bias. We observed significant sensitivity to random seed: seed 42 produced the most favorable accuracy--bias trade-off, while other seeds increased TCE.

We also explored alternative adaptation targets (e.g., all-linear layers) and higher ranks, but these increased capacity without improving bias robustness. In practice, attention-only LoRA with a moderate rank provided the most stable performance across random seeds and avoided overfitting to surface patterns in the training set.

\subsection{System Variants and Inference}
We evaluated: (i) \textbf{Baseline LoRA} (no SDA), (ii) \textbf{LoRA + SDA (primary)}, (iii) prompt-based variants (few-shot, structured CoT), and (iv) post-hoc variants (contrastive decoding, confidence filtering, DPO). The primary submission is (ii); others are ablations/negative results. Inference is performed with greedy decoding to avoid sampling noise. We found that increasing temperature or sampling multiple reasoning paths (self-consistency) consistently degraded the content-bias metric. The final system is therefore a single deterministic pass, which improves stability and reduces variance across runs.

\section{Experimental Setup}
We use the official English training set (960 items) and the official test set (191 items). The organizers do not provide a separate dev set; we therefore evaluate only on the official test split and do not re-partition the data. All ablations use the same test set. The augmented dataset is generated with \texttt{data\_augmentation.py} and combined with the original training data.

\subsection{Preprocessing}
We apply only the symbolic replacement pipeline described in Section 3 and keep tokenization unchanged. Quantifiers and negation markers are preserved verbatim. During training, we interleave original and augmented examples within each batch to avoid distribution shift between epochs.

\subsection{Hyperparameters and Tuning}
We use QLoRA with 4-bit NF4 quantization, attention-only LoRA adapters, and rank 16. We tune the symbolic--original mixing ratio to minimize TCE while retaining accuracy, fix random seed 42, and use a small learning rate with early stopping. Detailed hyperparameters and command lines are provided in the project archive.

For reproducibility, our training uses Qwen2.5-7B-Instruct with 4-bit NF4 quantization (double quantization enabled, BF16 compute) via \texttt{bitsandbytes}. We tokenize with the model's tokenizer and set \texttt{pad\_token = eos\_token}. We train for 3 epochs with learning rate $2\times10^{-4}$, per-device batch size 8, gradient accumulation 2 (effective batch size 16), max sequence length 512, and no packing. We use TRL \texttt{SFTTrainer} with BF16 and save one checkpoint per epoch. Our environment uses Python 3.12 on Ubuntu 22.04, PyTorch 2.8.0, and CUDA 12.8.

LoRA is applied to \texttt{q\_proj}, \texttt{k\_proj}, \texttt{v\_proj}, \texttt{o\_proj}, \texttt{gate\_proj}, \texttt{up\_proj}, and \texttt{down\_proj} with rank 16, \texttt{lora\_alpha} 32, and \texttt{lora\_dropout} 0.05. Each example is formatted as a three-turn chat: a system prompt that emphasizes logical validity over factual truth, a user prompt of the form ``Argument: <syllogism> Answer:'', and an assistant label of \emph{VALID} or \emph{INVALID}.

Training was performed on a single NVIDIA GTX 4090 GPU.

\begin{table*}[!t]
\centering
\small
\setlength{\tabcolsep}{5pt}
\begin{tabular}{lccc}
\toprule
\textbf{Method} & \textbf{ACC} & \textbf{TCE} & \textbf{Score} \\
\midrule
Qwen2.5-7B-Instruct (base) & 67.02 & 32.69 & 14.84 \\
Baseline LoRA (no SDA) & 95.81 & 3.12 & 39.64 \\
\textbf{LoRA + SDA (ours)} & 96.34 & 2.15 & 44.86 \\
Gemini-3-Pro (API only) & 99.48 & 1.06 & 57.68 \\
\textbf{Merged API + SDA (analysis)} & 100.00 & 0.00 & 100.00 \\
Few-shot prompting & 96.86 & 3.12 & 42.02 \\
Self-consistency (temp $>$ 0) & 95.3 & 3.5 & 39.1 \\
Contrastive decoding (alpha=1.0) & 89.0 & 16.6 & 23.0 \\
Structured CoT & 60.0 & 4.62 & 22.0 \\
DPO on same data & 95.81 & 2.15 & 44.62 \\
\bottomrule
\end{tabular}
\caption{Key ablations and results on the official test split (191 items). API and merged scores are reported for analysis and are not the primary submission. Numbers summarize representative runs from our logs.}
\end{table*}

\subsection{External Tools, Libraries, and Evaluation Measures}
Training and inference use \texttt{transformers}\footnote{\url{https://github.com/huggingface/transformers}}, \texttt{peft}\footnote{\url{https://github.com/huggingface/peft}}, \texttt{trl}\footnote{\url{https://github.com/huggingface/trl}}, and \texttt{torch}\footnote{\url{https://pytorch.org}}, with \texttt{bitsandbytes}\footnote{\url{https://github.com/TimDettmers/bitsandbytes}} for quantization. Versions follow the accompanying \texttt{requirements.txt}. We report overall accuracy (ACC) and total content effect (TCE). The official ranking metric is $\mathrm{ACC} / (1+\ln(1+\mathrm{TCE}))$, which rewards high accuracy while penalizing bias. We compute these using the official evaluation script.

\section{Results}
\subsection{Main Results}
Our best standalone model (LoRA+SDA) achieves 96.34\% accuracy and a TCE of 2.15, yielding a primary score of 44.86 under the official metric. These results represent the strongest trade-off between correctness and bias among methods that do not rely on external APIs. We did not record a shared-task leaderboard rank for this report. Compared with the unadapted Qwen2.5-7B-Instruct baseline (ACC 67.02, TCE 32.69, score 14.84), LoRA+SDA yields a large gain in accuracy while sharply reducing content effects, indicating that the improvement is not merely a calibration shift but a substantive reduction in bias-driven errors.

We also evaluated a Gemini-3-Pro API baseline on the test set, which achieved 99.48\% accuracy, TCE 1.06, and score 57.68. This provides a strong generalist reference point but still makes one error on the 191-item test set. The comparison highlights a complementary pattern: the API model is generally stronger but still fails on a small number of cases where the SDA specialist is correct.

\subsection{Ablation Analysis}
\paragraph*{Effect of symbolic augmentation.} Removing SDA (Baseline LoRA) reduces the score to 39.64 with higher TCE (3.12). This confirms that symbolic augmentation is the primary driver of bias reduction. The gains cannot be attributed to LoRA alone, since LoRA without SDA improves accuracy but leaves content effects relatively high.

\paragraph*{Implicit vs.\ explicit reasoning.} {\hyphenpenalty=10000\exhyphenpenalty=10000\emergencystretch=2em We find that explicit chains of thought (Mapping$\rightarrow$Structure$\rightarrow$Validity) degrade performance. On 7B models, long structured outputs add errors that overwhelm the final judgment. This points to a capacity mismatch between reasoning complexity and model size for this task, and to a training--inference mismatch when the model is trained on direct labels.}

\vspace{0.4\baselineskip}
\noindent\textbf{Self-consistency and sampling.} Majority voting across sampled paths (temperature $>$ 0) reduced accuracy and increased TCE. The model is already confident; sampling introduces spurious ``Invalid'' paths, harming both correctness and bias metrics. This suggests that stochastic decoding amplifies the base model's plausibility priors rather than revealing hidden correct paths.
\vspace{0.4\baselineskip}

\noindent\textbf{Contrastive decoding.} Subtracting base logits was intended to remove common-sense bias, but it also removed genuine logical competence present in the base model. This consistently reduced accuracy and worsened the score, indicating that the base model contributes useful reasoning signals alongside its biases.
\vspace{0.4\baselineskip}

\noindent\textbf{DPO on same data.} Training a DPO adapter on the same SFT data produced no gain: decision boundaries were already saturated. The resulting model retained TCE but lost small amounts of accuracy, consistent with mild overfitting and limited new supervision signal.

\paragraph*{Specialist--generalist complementarity.} The SDA specialist and the API model disagree on a small subset of cases (8/191). In those conflicts, the SDA model corrects the single API error, and a deterministic merge yields 100\% accuracy and TCE 0. We report this as diagnostic analysis, illustrating how a targeted specialist can complement a strong generalist.

Overall, these results suggest that direct LoRA training with SDA is the most robust and efficient strategy for this dataset, and that methods introducing longer intermediate reasoning or post-hoc logit manipulation are counterproductive at this model scale.

\subsection{Error Analysis}
We manually inspected errors from the best LoRA+SDA model. Most failures fall into two categories: (i) quantifier-scope confusion in multi-negation cases (e.g., \emph{``Not all A are B''} combined with \emph{``No B are C''}), and (ii) mismatches between existential and universal statements where validity depends on subtle logical form. These cases are precisely where belief bias and shallow heuristics are most likely to interfere. Because the error count is small, we do not report a confusion matrix; instead, we summarize error subtypes qualitatively.

\section{Conclusion}
We present a simple, robust system for Subtask 1: LoRA fine-tuning with symbolic data augmentation. The approach achieves high accuracy with low content bias, and ablations show that explicit reasoning, self-consistency, contrastive decoding, and DPO do not help at this model scale. SDA provides a practical path to decoupling content from validity; limitations include English-only coverage and potential lexical bias from augmentation. Our final submission uses the merged API + SDA strategy and attains a score of 100; among 45 teams, we are ranked 10th, with 11 teams tied at the top (all scoring 100). Looking ahead, we plan to extend SDA to multilingual settings and to probe failure cases involving quantifier scope and negation more systematically, with the goal of improving robustness without sacrificing efficiency. Looking ahead, we plan to extend SDA to multilingual settings and to probe failure cases involving quantifier scope and negation more systematically, with the goal of improving robustness without sacrificing efficiency.

\section*{Acknowledgments}
We thank the SemEval-2026 Task 11 organizers for the task design and evaluation resources.

\bibliography{references}

\end{document}