Ruhaib Muhammad

2026

IndicSteer: Inference-Time Safety Steering for Indic LLMs
Ruhaib Muhammad | Saahas Vijayalakshmi Rajaram | Suriya Priyan Durairaj
Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026)

Safety controls for Indic language generation must account for multilingual variation and culturally grounded harm categories that are underrepresented in English-centric resources. We present IndicSteer, an initial study of inference-time activation steering for safety across 8 harm categories and 9 Indic language settings, based on contrastive directions computed from safe/unsafe response pairs. To the best of our knowledge, this is the first application of Contrastive Activation Addition (CAA) to Indic LLMs. Evaluation uses a structured LLM-as-a-judge protocol with strict isolation by category and alpha, covering ≈12,960 prompt-response pairs. We report harmful-response and coherence metrics for Sarvam-1 and OpenHathi (Hindi track), and present cross-lingual representation structure via linear CKA for Sarvam-1 and Krutrim-2-Instruct. On matched slices, Sarvam-1 at 𝛼=12 reduces harmful rate from 73.47% to 41.34% (32.13 pp; 43.73% relative) with no additional retraining. For OpenHathi Hindi, harmful rate falls monotonically from 85.83% (baseline) to 27.13% at 𝛼=15, a 58.71 pp total reduction.

Co-authors

Venues

StereACuLT1
WS1

Fix author