Krish Sharma

Papers on this page may belong to the following people: Krish Sharma, Krish Sharma

2026

blue at SemEval-2026 Task 5: NarrBERT : Narrative-Aware BERT for Word Sense Disambiguation
Rhea Singhal | Krish Sharma | Lakksh Sharma | Jatin Bedi
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

This paper outlines the method submitted by team blue for the SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Sentences through Narrative (AmbiStory). The task requires predicting reasonable scores that match human thoughts and judgments instead of just picking a single correct sense as the output. This means that contextual reasoning with fine-grain contextual modeling is vital. In order to tackle this problem, we suggest a BERT-based cross-encoder regression model. This model encodes the entire narrative context, which includes the precontext, the ambiguous sentence, and the ending, along with candidate sense definitions and example usages. Unlike bi-encoder sentence-level methods, our model allows for token-level interaction between story cues and sense meanings. This interaction helps capture subtle narrative disambiguation signals. We conduct a systematic exploration of model architectures and training strategies, progressing from a sentence-transformer baseline to an optimised BERT cross-encoder. On the development set, our best configuration achieves a Spearman rank correlation of 0.66. On the official test set, the system achieves a Spearman correlation of 0.4866 and an Accuracy-within-Standard-Deviation of 0.6613, substantially outperforming sentence-transformer bi-encoder baselines.

pdf bib abs

blue at SemEval-2026 Task 4: Synergizing Long-Context Reranking with Semantic Similarity for Narrative Alignment
Krish Sharma | Lakksh Sharma | Rhea Singhal | Jatin Bedi
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

This paper describes the system submitted by team blue for SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning, with a primary focus on the Pairwise Similarity subtask (Track A). The core challenge of this task lies in identifying deep structural alignments between stories, which is fundamentally hindered by the restricted context windows of standard transformer architecturesthat truncate narratives before reaching critical plot resolutions. To overcome this context bottleneck, we propose a hybrid ensemble architecture designed to capture extended narrative arcs. Our approach synergizes a cross-encoder (Jina Reranker v2), which processes long inputs via a sliding-window strategy over 1,024-token chunks, to evaluate the global "course of action," with a semantic bi-encoder (RoBERTa-Large) to validate local tonal consistency. This dual-stream system achieved a Pearson correlation score of 0.63, demonstrating that processing narrative content beyond the 512-token truncation boundary is strictly necessary for accurate pairwise narrative comparison.

pdf bib abs

Lakksh at SemEval-2026 Task 11(1 2): Neuro-Symbolic Decomposition to Mitigate Content Bias in Syllogistic Reasoning
Lakksh Sharma | Krish Sharma | Jatin Bedi
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

Syllogistic reasoning is the ability to distinguish logical validity from semantic plausibility — a setting in which LLMs succumb to frequent content bias by conflating the two. The result is a characteristic failure to recognize logically valid arguments with highly implausible conclusions and logically invalid but semantically plausible arguments. This paper introduces a neuro-symbolic system that avoids this behavior by design: neural structure extraction is strictly separated from symbolic validity checking. A T5-Small parser is trained only on synthetic nonsense-symbol syllogisms, ensuring that the structural parse is learned in the absence of real-world semantics. Validity checking is performed by a deterministic symbolic kernel operating on extracted logical form alone, ensuring that semantic content cannot influence the final call. In binary validity classification, the system achieves 97.38% accuracy with a Total Content Effect of 3.10; in the retrieval setting, it achieves 82.11% accuracy with 99.47% F1 on premise identification. Ablation experiments show that formal theorem proving via NL-to-Z3 translation actually increases content bias due to leakage in intermediate representations. The results recommend architectural separation as a promising content-robustness strategy for syllogistic reasoning.

2025

pdf bib abs

DIMSUM: Discourse in Mathematical Reasoning as a Supervision Module
Krish Sharma | Niyar R Barman | Akshay Chaturvedi | Nicholas Asher
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

We look at reasoning on GSM8k, a dataset of short texts presenting primary school, math problems. We find, with Mirzadeh et al (2024), that current LLM progress on the data set may not be explained by better reasoning but by exposure to a broader pretraining data distribution. We then introduce a novel information source for helping models with less data or inferior training reason better: discourse structure. We show that discourse structure improves performance for models like Llama2 13b by up to 160%. Even for models that have most likely memorized the data set, adding discourse structural information to the model still improves predictions and dramatically improves large model performance on out of distribution examples.

2023

pdf bib abs

With the rise of prolific ChatGPT, the risk and consequences of AI-generated text has increased alarmingly. This triggered a series of events, including an open letter, signed by thousands of researchers and tech leaders in March 2023, demanding a six-month moratorium on the training of AI systems more sophisticated than GPT-4. To address the inevitable question of ownership attribution for AI-generated artifacts, the US Copyright Office released a statement stating that “if the content is traditional elements of authorship produced by a machine, the work lacks human authorship and the office will not register it for copyright”. Furthermore, both the US and the EU governments have recently drafted their initial proposals regarding the regulatory framework for AI. Given this cynosural spotlight on generative AI, AI-generated text detection (AGTD) has emerged as a topic that has already received immediate attention in research, with some initial methods having been proposed, soon followed by the emergence of techniques to bypass detection. This paper introduces the Counter Turing Test (CT2), a benchmark consisting of techniques aiming to offer a comprehensive evaluation of the robustness of existing AGTD techniques. Our empirical findings unequivocally highlight the fragility of the proposed AGTD methods under scrutiny. Amidst the extensive deliberations on policy-making for regulating AI development, it is of utmost importance to assess the detectability of content generated by LLMs. Thus, to establish a quantifiable spectrum facilitating the evaluation and ranking of LLMs according to their detectability levels, we propose the AI Detectability Index (ADI). We conduct a thorough examination of 15 contemporary LLMs, empirically demonstrating that larger LLMs tend to have a lower ADI, indicating they are less detectable compared to smaller LLMs. We firmly believe that ADI holds significant value as a tool for the wider NLP community, with the potential to serve as a rubric in AI-related policy-making.