IndicSteer: Inference-Time Safety Steering for Indic LLMs

Ruhaib Muhammad; Saahas Vijayalakshmi Rajaram; Suriya Priyan Durairaj

IndicSteer: Inference-Time Safety Steering for Indic LLMs

Ruhaib Muhammad, Saahas Vijayalakshmi Rajaram, Suriya Priyan Durairaj

Abstract

Safety controls for Indic language generation must account for multilingual variation and culturally grounded harm categories that are underrepresented in English-centric resources. We present IndicSteer, an initial study of inference-time activation steering for safety across 8 harm categories and 9 Indic language settings, based on contrastive directions computed from safe/unsafe response pairs. To the best of our knowledge, this is the first application of Contrastive Activation Addition (CAA) to Indic LLMs. Evaluation uses a structured LLM-as-a-judge protocol with strict isolation by category and alpha, covering ≈12,960 prompt-response pairs. We report harmful-response and coherence metrics for Sarvam-1 and OpenHathi (Hindi track), and present cross-lingual representation structure via linear CKA for Sarvam-1 and Krutrim-2-Instruct. On matched slices, Sarvam-1 at 𝛼=12 reduces harmful rate from 73.47% to 41.34% (32.13 pp; 43.73% relative) with no additional retraining. For OpenHathi Hindi, harmful rate falls monotonically from 85.83% (baseline) to 27.13% at 𝛼=15, a 58.71 pp total reduction.

Anthology ID:: 2026.stereacult-1.12
Volume:: Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Weicheng Ma, Soroush Vosoughi, Nabeel Gillani, Rolando Coto-Solano
Venues:: StereACuLT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 126–136
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.stereacult-1.12/
DOI:
Bibkey:
Cite (ACL):: Ruhaib Muhammad, Saahas Vijayalakshmi Rajaram, and Suriya Priyan Durairaj. 2026. IndicSteer: Inference-Time Safety Steering for Indic LLMs. In Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026), pages 126–136, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: IndicSteer: Inference-Time Safety Steering for Indic LLMs (Muhammad et al., StereACuLT 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.stereacult-1.12.pdf

PDF Cite Search Fix data