Gokul Swamy


2026

When asked a question in a language less seen in its training data, current reasoning large language models (RLMs) often exhibit dramatically lower performance than when asked the same question in English. In response, we introduce (Self-Play with Privileged Pairwise Feedback), a two-stage framework for enhancing multilingual reasoning without any data in the target language(s). First, we supervise fine-tune (SFT) on translated versions of English question-answer pairs to raise base model correctness. Second, we perform RL with feedback from a pairwise judge in a self-play fashion, with the judge receiving the English reference response as privileged information. Thus, even when none of the model’s responses are completely correct, the privileged pairwise judge can still tell which response is better. End-to-end, greatly improves base model performance, even outperforming fully post-trained models on multiple math and non-math tasks with less than 1/8 of the training data across the single-language, multilingual, and generalization to unseen language settings.

2025

For an e-commerce domain, the customeraddress is the single most important pieceof customer data for ensuring accurateand reliable deliveries. In this two-partstudy, we first outline the construction ofa language model to assist customers withaddress standardization and in the latterpart, we detail a novel Pareto-ensemblemulti-task prediction algorithm that derives critical insights from customer addresses to minimize operational losses arising from a given geographical area. Finally, we demonstrate the potential benefits ofthe proposed address intelligence systemfor a large e-commerce domain throughlarge scale experiments on a commercialsystem.
Vision Language Models (VLMs) have achieved significant advancements in complex visual understanding tasks. However, VLMs are prone to hallucinations—generating outputs that lack alignment with visual content. This paper addresses hallucination detection in VLMs by leveraging the visual grounding information encoded in transformer attention maps. We identify three primary challenges in this approach: the elective nature of visual grounding for certain tokens, the high-dimensional and noisy nature of attention maps, and the dynamic sequence length of attention on previous tokens. To address these, we propose VADE, a novel sequence modelling approach to effectively learn complex sequential patterns from high-dimensional and noisy attention maps for fine-grained hallucination detection and mitigation. VADE achieves an average PR-AUC of 80% in hallucination detection on M-HalDetect across four different model architectures and an 5% improvement in hallucination mitigation on MSCOCO.