2025
pdf
bib
abs
Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning
Jeffrey Olmo
|
Jared Wilson
|
Max Forsey
|
Bryce Hepner
|
Thomas Vincent Howe
|
David Wingate
Findings of the Association for Computational Linguistics: NAACL 2025
Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network’s internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs.To address this, we introduce Gradient SAEs (g-SAEs), which modify the k-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the k elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network.Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts.By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both representations, retrospectively, and actions, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.
2022
pdf
bib
abs
An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels
Taylor Sorensen
|
Joshua Robinson
|
Christopher Rytting
|
Alexander Shaw
|
Kyle Rogers
|
Alexia Delorey
|
Mahmoud Khalil
|
Nancy Fulda
|
David Wingate
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pre-trained language models derive substantial linguistic and factual knowledge from the massive corpora on which they are trained, and prompt engineering seeks to align these models to specific tasks. Unfortunately, existing prompt engineering methods require significant amounts of labeled data, access to model parameters, or both. We introduce a new method for selecting prompt templates without labeled examples and without direct access to the model. Specifically, over a set of candidate templates, we choose the template that maximizes the mutual information between the input and the corresponding model output. Across 8 datasets representing 7 distinct NLP tasks, we show that when a template has high mutual information, it also has high accuracy on the task. On the largest model, selecting prompts with our method gets 90% of the way from the average prompt accuracy to the best prompt accuracy and requires no ground truth labels.
pdf
bib
abs
Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models
David Wingate
|
Mohammad Shoeybi
|
Taylor Sorensen
Findings of the Association for Computational Linguistics: EMNLP 2022
We explore the idea of compressing the prompts used to condition language models, and show that compressed prompts can retain a substantive amount of information about the original prompt. For severely compressed prompts, while fine-grained information is lost, abstract information and general sentiments can be retained with surprisingly few parameters, which can be useful in the context of decode-time algorithms for controllability and toxicity reduction. We find that some complex prompts can be effectively compressed into a single token to guide generation. We also show that compressed prompts are largely compositional, and can be constructed such that they can be used to control independent aspects of generated text.