Do Androids Know They’re Only Dreaming of Electric Sheep?

Sky CH-Wang; Benjamin Van Durme; Jason Eisner; Chris Kedzie

doi:10.18653/v1/2024.findings-acl.260

Do Androids Know They’re Only Dreaming of Electric Sheep?

Sky CH-Wang, Benjamin Van Durme, Jason Eisner, Chris Kedzie

Abstract

We design probes trained on the internal representations of a transformer language model to predict its hallucinatory behavior on three grounded generation tasks. To train the probes, we annotate for span-level hallucination on both sampled (organic) and manually edited (synthetic) reference outputs. Our probes are narrowly trained and we find that they are sensitive to their training domain: they generalize poorly from one task to another or from synthetic to organic hallucinations. However, on in-domain data, they can reliably detect hallucinations at many transformer layers, achieving 95% of their peak performance as early as layer 4. Here, probing proves accurate for evaluating hallucination, outperforming several contemporary baselines and even surpassing an expert human annotator in response-level detection F1. Similarly, on span-level labeling, probes are on par or better than the expert annotator on two out of three generation tasks. Overall, we find that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.

Anthology ID:: 2024.findings-acl.260
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4401–4420
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-acl.260/
DOI:: 10.18653/v1/2024.findings-acl.260
Bibkey:
Cite (ACL):: Sky CH-Wang, Benjamin Van Durme, Jason Eisner, and Chris Kedzie. 2024. Do Androids Know They’re Only Dreaming of Electric Sheep?. In Findings of the Association for Computational Linguistics: ACL 2024, pages 4401–4420, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Do Androids Know They’re Only Dreaming of Electric Sheep? (CH-Wang et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-acl.260.pdf

PDF Cite Search Fix data