Vipin Chaudhary


2026

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across 20 reasoning models on four Olympiad-style math benchmarks (AIME’24, AIME’25, HMMT’25, and BrUMO’25; up to N = 80 trials), most full-trial rankings agree closely with the Bayesian gold standard Bayes_𝒰@80 (mean Kendall’s τ_b = 0.93–0.95), and 19–34 methods recover exactly the same ordering. In the single-trial regime, the best methods reach τ_b ≈ 0.86.Using greedy decoding as an empirical prior (Bayes_R₀@N) reduces variance at N = 1 by 16–52%, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.
Reasoning-capable large language models (LLMs) achieve strong performance on complex tasks but often exhibit overthinking after distillation, generating unnecessarily long chain-of-thought (CoT) reasoning even for simple inputs and incurring high inference cost. However, naively shortening reasoning length can degrade reasoning accuracy, as concise reasoning may be insufficient for certain inputs and lacks explicit supervision. We propose Auto Long-Short Reasoning (AutoL2S), a distillation framework that empowers non-reasoning LLMs to think thoroughly but only when necessary. AutoL2S first learns a lightweight switching token with verified long-short CoTs to enable instance-wise long-short reasoning selection. Then it leverages long-short reasoning rollouts induced by switching tokens within a GRPO-style loss to improve reasoning efficiency while maintaining accuracy. Experiments demonstrate that AutoL2S effectively reduces reasoning length up to 71% with minimal accuracy loss, yielding markedly better trade-off in token length and inference time while preserving accuracy. The code is available at https://github.com/amandaluof/AutoL2S.
Reasoning language models are controlled through explicit modes such as Think and No-think, yet we find that these behaviors are largely governed by a few token-level triggers rather than high-level instructions. Through attention analysis and controlled prompting experiments, we show that a leading “Okay” token induces reasoning behavior, while the newline pattern following ‘</think>‘ suppresses it. Based on this observation, we propose Mid-Think, a simple training-free prompting format that combines these triggers to achieve intermediate-budget reasoning, consistently outperforming fixed-token and prompt-based baselines in terms of the accuracy–length trade-off. Furthermore, applying Mid-Think to RL training after SFT reduces training time by approximately 15% while improving final performance of Qwen3-8B on AIME from 69.8% to 72.4% and on GPQA from 58.5% to 61.1%, demonstrating its effectiveness for both inference-time control and RL-based reasoning training.
Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key weight matrices systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.

2025

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM shall enable its users to effortlessly process many originally exhausting tasks — e.g., digesting a long-form document to find answers v.s., directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have a few major shortcomings. For instance, some Needle-in-a-Haystack-like benchmarks are too synthetic, and therefore do not represent the real world usage of LLMs. While some real-task-based benchmarks like LongBench avoid this problem, such benchmarks are often formed in a way where each data sample has a fixed sequence length, which not only makes them solely suitable for models with a certain range of context windows, but also lacks a proxy to know at what length the model/method-of-interest would fail. Last, most benchmarks tend to not provide proper metrics to separate long-context performance from the model’s baseline ability, so when conducting a cross-model/recipe comparison, such conflation makes the user unable to understand how exactly one model or recipe excels at the long-context task in relation to its baseline ability. To address these issues, we introduce a length-controllable, real-life reflective benchmark with a novel metric that disentangles baseline knowledge from long-context capabilities. Experiments demonstrate the superiority of our datasets in effectively evaluating LLMs. All assets are available at https://github.com/uservan/100-LongBench.git.
Backdoor attacks are powerful and effective, but distributing LLMs without a proven track record like ‘meta-llama‘ or ‘qwen‘ rarely gains community traction. We identify LoRA sharing as a unique scenario where users are more willing to try unendorsed assets, since such shared LoRAs allow them to enjoy personalized LLMs with negligible investment. However, this convenient share-and-play ecosystem also introduces a new attack surface, where attackers can distribute malicious LoRAs to an undefended community. Despite the high-risk potential, no prior art has comprehensively explored LoRA’s attack surface under the downstream-enhancing share-and-play context. In this paper, we investigate how backdoors can be injected into task-enhancing LoRAs and examine the mechanisms of such infections. We find that with a simple, efficient, yet specific recipe, **a backdoor LoRA can be trained once and then seamlessly merged (in a training-free fashion) with multiple task-enhancing LoRAs, retaining both its malicious backdoor and benign downstream capabilities.** This allows attackers to scale the distribution of compromised LoRAs with minimal effort by leveraging the rich pool of existing shared LoRA assets. We note that such merged LoRAs are particularly *infectious* — because their malicious intent is cleverly concealed behind improved downstream capabilities, creating a strong incentive for voluntary download — and *dangerous* — because under local deployment, no safety measures exist to intervene when things go wrong. Our work is among the first to study this new threat model of training-free distribution of downstream-capable-yet-backdoor-injected LoRAs, highlighting the urgent need for heightened security awareness in the LoRA ecosystem. **Warning: This paper contains offensive content and involves a real-life tragedy.**
Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across multiple metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.

2024

Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches — such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures — have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this gap by providing a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks. Our work reveals numerous previously unknown phenomena and offers insights — as well as a friendly workbench — for the future development of long context-capable LLMs. The source code is available at https://github.com/henryzhongsc/longctx_bench.
Ensuring the security of released large language models (LLMs) poses a significant dilemma, as existing mechanisms either compromise ownership rights or raise data privacy concerns. To address this dilemma, we introduce TaylorMLP to protect the ownership of released LLMs and prevent their abuse. Specifically, TaylorMLP preserves the ownership of LLMs by transforming the weights of LLMs into parameters of Taylor-series. Instead of releasing the original weights, developers can release the Taylor-series parameters with users, thereby ensuring the security of LLMs. Moreover, TaylorMLP can prevent abuse of LLMs by adjusting the generation speed. It can induce low-speed token generation for the protected LLMs by increasing the terms in the Taylor-series. This intentional delay helps LLM developers prevent potential large-scale unauthorized uses of their models. Empirical experiments across five datasets and three LLM architectures demonstrate that TaylorMLP induces over increase in latency, producing the tokens precisely matched with original LLMs. Subsequent defensive experiments further confirm that TaylorMLP effectively prevents users from reconstructing the weight values based on downstream datasets.

2019

Neural network models have shown promise in the temporal relation extraction task. In this paper, we present the attention based neural network model to extract the containment relations within sentences from clinical narratives. The attention mechanism used on top of GRU model outperforms the existing state-of-the-art neural network models on THYME corpus in intra-sentence temporal relation extraction.

2017

In this paper, we present MayoNLP’s results from the participation in the ScienceIE share task at SemEval 2017. We focused on the keyphrase classification task (Subtask B). We explored semantic similarities and patterns of keyphrases in scientific publications using pre-trained word embedding models. Word Embedding Distance Pattern, which uses the head noun word embedding to generate distance patterns based on labeled keyphrases, is proposed as an incremental feature set to enhance the conventional Named Entity Recognition feature sets. Support vector machine is used as the supervised classifier for keyphrase classification. Our system achieved an overall F1 score of 0.67 for keyphrase classification and 0.64 for keyphrase classification and relation detection.