Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification

Tao Meng; Ninareh Mehrabi; Palash Goyal; Anil Ramakrishna; Aram Galstyan; Richard Zemel; Kai-Wei Chang; Rahul Gupta; Charith Peris

doi:10.18653/v1/2024.findings-emnlp.779

Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification

Tao Meng, Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Aram Galstyan, Richard Zemel, Kai-Wei Chang, Rahul Gupta, Charith Peris

Abstract

We propose a constraint learning schema forfine-tuning Large Language Models (LLMs)with attribute control. Given a training corpusand control criteria formulated as a sequence-level constraint on model outputs, our methodfine-tunes the LLM on the training corpus whileenhancing constraint satisfaction with minimalimpact on its utility and generation quality.Specifically, our approach regularizes the LLMtraining by penalizing the KL divergence be-tween the desired output distribution, which sat-isfies the constraints, and the LLM’s posterior.This regularization term can be approximatedby an auxiliary model trained to decomposethe sequence-level constraints into token-levelguidance, allowing the term to be measuredby a closed-form formulation. To further im-prove efficiency, we design a parallel schemefor concurrently updating both the LLM andthe auxiliary model. We evaluate the empiricalperformance of our approach by controlling thetoxicity when training an LLM. We show thatour approach leads to an LLM that producesfewer inappropriate responses while achievingcompetitive performance on benchmarks and atoxicity detection task

Anthology ID:: 2024.findings-emnlp.779
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13329–13341
Language:
URL:: https://preview.aclanthology.org/ingest_wac_2008/2024.findings-emnlp.779/
DOI:: 10.18653/v1/2024.findings-emnlp.779
Bibkey:
Cite (ACL):: Tao Meng, Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Aram Galstyan, Richard Zemel, Kai-Wei Chang, Rahul Gupta, and Charith Peris. 2024. Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13329–13341, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification (Meng et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest_wac_2008/2024.findings-emnlp.779.pdf

PDF Cite Search Fix data