Daichi Yamaga

2026

CuriosAI at SemEval-2026 Task 10:Hybrid approaches to conspiracy span extraction and conspiracy detection
Hiroki Takushima | Fumika Beppu | Aiswariya Manoj Kumar | Yuki Shibata | Takayuki Hori | Daichi Yamaga
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present CuriosAI’s system for SemEval-2026 Task 10, addressing Conspiracy Marker Extraction and Conspiracy Detection. For marker extraction, we employ multi-label token classification with a bidirectional transformer (DeBERTa-v3-large) to predict overlapping spans. Alternative feature-based and LLM-based approaches do not surpass the encoder baseline. For Conspiracy Detection, we compare heterogeneous models, including transformer fine-tuning, lexical classifiers, embedding-based models, and LLM-based refinement. Development-optimal models do not always generalize best; logit-level ensembling achieves the strongest test performance (F1=0.7620). These results highlight the importance of bidirectional token modeling for span extraction and calibration-aware ensembling for robust detection.

pdf bib abs

CuriosAI at SemEval-2026 Task 4: A Comprehensive Study of Zero-Shot versus Fine-Tuned Approaches for Narrative Similarity
Yuki Shibata | Hiroki Takushima | Fumika Beppu | Aiswariya Manoj Kumar | Daichi Yamaga | Takayuki Hori
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

This paper presents our system for SemEval-2026 Task 4 on narrative similarity assessment.Through comprehensive experimentation, we evaluated various approaches including zero-shot pre-trained models, prompt engineering with large language models, and multiple fine-tuning strategies using synthetic data. Our experiments revealed a surprising finding: pre-trained sentence transformers in a zero-shot setting consistently outperformed all fine-tuning attempts. Specifically, our best system using sentence-transformers/sentence-t5-xl achieved 67.5% accuracy on the development set (95% CI: [61.0%, 74.0%]), while all fine-tuning approaches resulted in performance degradation of 2-18 percentage points. We provide a detailed analysis of why fine-tuning failed and discuss the implications for narrative similarity tasks.

pdf bib abs

This paper proposes a method for predicting continuous emotion dimensions, namely Valence and Arousal, from text by combining affective intermediate training with multi-task learning. The proposed approach consists of two training phases: an intermediate pre-training phase using external emotion datasets, followed by a multi-task learning phase using task-specific data. RoBERTa-large is employed as the backbone model, and independent regression heads are introduced for each subtask. Experimental results show that the proposed method achieves Pearson correlation coefficients of 0.68 for Valence and 0.45 for Arousal on Subtask 1, demonstrating stable performance, particularly in capturing inter-user differences in emotional expression.

pdf bib abs

CuriosAI at SemEval-2026 Task 8: Hybrid retrieval system with repeated sampling for generation
Aiswariya Manoj Kumar | Hiroki Takushima | Fumika Beppu | Yuki Shibata | Daichi Yamaga | Takayuki Hori
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

SemEval-2026 Task 8 (MTRAGEval) evaluates multi-turn Retrieval-Augmented Generation (RAG) under conversational challenges such as non-standalone turns, underspecification, and answerability detection. These conditions amplify retrieval and generation errors that standard single-turn RAG pipelines fail to address effectively. We present a robustness-oriented multi-turn RAG system combining contextual query rewriting, heterogeneous hybrid retrieval fused with Reciprocal Rank Fusion (RRF), domain-adaptive Low-Rank Adaptation (LoRA) reranking, and repeated sampling with metric-guided selection. On the official test set, our approach outperforms the organizers’ baselines across all subtasks: Retrieval (nDCG@5: 0.5396 vs. 0.4795), Generation (0.7571 vs. 0.6390), and RAG (0.5486 vs. 0.5366). Our system ranks 5th in Subtask A, 5th in Subtask B, and 7th in Subtask C on the official leaderboard. These results demonstrate that calibrated hybrid retrieval combined with robust generation selection is effective for multi-turn RAG.

Co-authors

Aiswariya Manoj 1

Venues

SemEval4
WS4

Fix author