Multi-Attribute Steering of Language Models via Targeted Intervention

Duy Nguyen; Archiki Prasad; Elias Stengel-Eskin; Mohit Bansal

Multi-Attribute Steering of Language Models via Targeted Intervention

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Abstract

Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM’s parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. We achieve this by learning steering vectors using an alignment objective that shifts the model’s internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., average 3% accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).

Anthology ID:: 2025.acl-long.1007
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20619–20634
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.acl-long.1007/
DOI:
Bibkey:
Cite (ACL):: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. 2025. Multi-Attribute Steering of Language Models via Targeted Intervention. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20619–20634, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Multi-Attribute Steering of Language Models via Targeted Intervention (Nguyen et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.acl-long.1007.pdf

PDF Cite Search Fix data