Saran Krishnasamy

2026

GigitAI at SemEval-2026 Task 11: Hybrid Symbolic-Neural Approach for Syllogistic Validity Classification
Saran Krishnasamy
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present our system for SemEval-2026 Task 11 on classifying whether syllogisms are logically valid. The main challenge is that language models tend to judge arguments based on whether the conclusion sounds true in the real world, rather than whether it follows logically from the premises. We evaluate direct prompting across six models (GPT-4o, GPT-5.2, o3, o3-mini, Claude Opus 4.6, Claude Sonnet 4) with three prompt strategies, finding that even the best achieves only 89.5% accuracy. Our best-performing system splits the task into two parts: GPT-4o-mini extracts the logical structure, then deterministic rules check validity, enhanced with bidirectional premise checking, predicate negation post-processing, and a targeted rule-based fallback for double negation. This achieves 98.95% accuracy on Subtask 1 (combined score 57.74) and 85.8% validity accuracy on Subtask 2. We also explore self-consistency with symbolic verification (93.1%), content abstraction, activation steering, contrastive fine-tuning, RLVR, and diffusion-based reasoning, finding that content abstraction surprisingly degrades performance, revealing that semantic content provides essential parsing scaffolding alongside the bias it introduces.

pdf bib abs

GigitAI at SemEval-2026 Task 8: Hybrid Sparse-Dense Retrieval and Zero-Shot Generation for Multi-Turn Conversational RAG
Saran Krishnasamy | Inez Wihardjo
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We describe our system for SemEval-2026 Task 8 (MTRAGEval) on multi-turn conversational RAG. Our approach combines hybrid retrieval (fusing SPLADE-v3 learned sparse representations with dense embeddings via Reciprocal Rank Fusion) with a fine-tuned cross-encoder reranker and zero-shot LLM generation using Claude Opus 4.5. We systematically evaluate 56 retrieval configurations across 4 domains, and 5 generation strategies across 5 LLMs. Our findings show that: (1) SPLADE-v3 with dataset rewrites substantially outperforms BM25 across all configurations, (2) simple zero-shot prompting matches sophisticated strategies like Self-RAG and CRAG, and (3) performance varies significantly by answerability class. On the test set, we achieve rank 5/29 on Task C (end-to-end RAG, H=0.5564), rank 7/26 on Task B (generation, H=0.7495), and rank 13/38 on Task A (retrieval, nDCG@5=0.4782). Our analysis reveals strong performance on answerable queries (H=0.685) but degradation on underspecified queries (H=0.254).

2023

pdf bib abs

Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task
Neema Kotonya | Saran Krishnasamy | Joel Tetreault | Alejandro Jaimes
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

This paper describes and analyzes our participation in the 2023 Eval4NLP shared task, which focuses on assessing the effectiveness of prompt-based techniques to empower Large Language Models to handle the task of quality estimation, particularly in the context of evaluating machine translations and summaries. We conducted systematic experiments with various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting. In addition, we integrated these approaches with zero-shot and one-shot learning methods to maximize the efficacy of our evaluation procedures. Our work reveals that combining these approaches using a “small”, open source model (orca_mini_v3_7B) yields competitive results.

Co-authors

Venues

Fix author