Fan Yang
Other people with similar names: Fan Yang, Fan Yang, Fan Yang, Fan Yang
Unverified author pages with similar names: Fan Yang
2026
FaithLM: Towards Faithful Explanations for Large Language Models
Yu-Neng Chuang | Guanchu Wang | Chia-Yuan Chang | Ruixiang Tang | Shaochen Zhong | Fan Yang | Andrew Wen | Mengnan Du | Xuanting Cai | Vladimir Braverman | Xia Hu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Yu-Neng Chuang | Guanchu Wang | Chia-Yuan Chang | Ruixiang Tang | Shaochen Zhong | Fan Yang | Andrew Wen | Mengnan Du | Xuanting Cai | Vladimir Braverman | Xia Hu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) increasingly produce natural language explanations, yet these explanations often lack faithfulness, and they do not reliably reflect the evidence the model uses to decide. We introduce FaithLM, a model-agnostic framework that evaluates and improves the faithfulness of LLM explanations without token masking or task-specific heuristics. FaithLM formalizes explanation faithfulness as an intervention property: a faithful explanation should yield a prediction shift when its content is contradicted. Theoretical analysis shows that the resulting contrary-hint score is a sound and discriminative estimator of faithfulness. Building on this principle, FaithLM iteratively refines both the elicitation prompt and the explanation to maximize the measured score. Experiments on three multi-domain datasets and multiple LLM backbones demonstrate that FaithLM consistently increases faithfulness and produces explanations more aligned with human rationales than strong self-explanation baselines. These findings highlight that intervention-based evaluation, coupled with iterative optimization, provides a principled route toward faithful and reliable LLM explanations.
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
Haiyan Zhao | Xuansheng Wu | Fan Yang | Bo Shen | Ninghao Liu | Mengnan Du
Findings of the Association for Computational Linguistics: EACL 2026
Haiyan Zhao | Xuansheng Wu | Fan Yang | Bo Shen | Ninghao Liu | Mengnan Du
Findings of the Association for Computational Linguistics: EACL 2026
Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16% across six challenging concepts, while maintaining topic relevance.