Thommen George Karimpanal

2026

Improving Multilingual Language Models by Aligning Representations through Steering
Omar Mohamed Mahmoud | Buddhika Laknath Semage | Thommen George Karimpanal | Santu Rana
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This paper investigates how Large Language Models (LLMs) represent non-English tokens—a question that remains underexplored despite recent progress. We propose a lightweight intervention method using representation steering, where a learned vector is added to the residual stream at a single model layer to enhance multilingual performance. Through extensive experiments across seven competitive baselines—including prompt optimization, supervised fine-tuning (SFT), in-context learning, cross-lingual transfer, projection mapping techniques, and translation-based methods—we show that our approach consistently outperforms most alternatives. In particular, it achieves performance on par with production-grade translation systems while requiring far fewer resources. We further explore the complementarity between our method and SFT, demonstrating that steering offers a direct, efficient way to realign internal representations. These findings underscore the potential of activation-level interventions as a powerful tool for improving the multilingual capabilities of LLMs.

pdf bib abs

The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMs
Omar Mahmoud | Ali Khalil | Thommen George Karimpanal | Buddhika Laknath Semage | Santu Rana
Findings of the Association for Computational Linguistics: EACL 2026

Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.

Co-authors

Venues

Findings1
LREC1

Fix author