Sharan Maiya

2025

pdf bib abs
Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes
Sharan Maiya | Yinhong Liu | Ramit Debnath | Anna Korhonen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) are often used as automated judges to evaluate text, but their effectiveness can be hindered by various unintentional biases. We propose using linear classifying probes, trained by leveraging differences between contrasting pairs of prompts, to directly access LLMs’ latent knowledge and extract more accurate preferences. Through extensive experiments using models of varying size from four different families and six diverse datasets assessing text quality evaluation and common sense reasoning, we demonstrate that both supervised and unsupervised probing approaches consistently outperform traditional generation-based judgement while maintaining similar computational costs. These probes generalise under domain shifts and can even outperform finetuned evaluators with the same training data size. Our results suggest linear probing offers an accurate, robust and computationally efficient approach for LLM-as-judge tasks while providing interpretable insights into how models encode judgement-relevant knowledge. Our data and code will be openly released in the future.

2024

pdf bib abs
Cluster-Norm for Unsupervised Probing of Knowledge
Walter Laurito | Sharan Maiya | Grégoire Dhimoïla | Owen Ho Wan Yeung | Kaarel Hänni
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The deployment of language models brings challenges in generating reliable text, especially when these models are fine-tuned with human preferences. To extract the encoded knowledge in these models without (potentially) biased human labels, unsupervised probing techniques like Contrast-Consistent Search (CCS) have been developed (Burns et al., 2022). However, salient but unrelated features in activation space can mislead these probes (Farquhar et al., 2023). Addressing this, we propose a cluster-normalization method to minimize the impact of such features by clustering and normalizing activations of contrast pairs before applying unsupervised probing techniques. While this approach does not address the issue of distinguishing between latent knowledge and that portrayed by a simulated agent—a major issue in the literature of eliciting latent knowledge (Paul Christiano and Xu, 2021)—it still significantly improves the accuracy of probes in identifying the intended knowledge amidst distractions.

Co-authors

Yinhong Liu 1

Owen Ho Wan Yeung 1

Venues

acl1
emnlp1

Fix author