Unsupervised Evaluation for Question Answering with Transformers

Lukas Muttenthaler, Isabelle Augenstein, Johannes Bjerva


Abstract
It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labelled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model’s answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in semi-automatic development of QA datasets.
Anthology ID:
2020.blackboxnlp-1.8
Volume:
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2020
Address:
Online
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
83–90
Language:
URL:
https://aclanthology.org/2020.blackboxnlp-1.8
DOI:
10.18653/v1/2020.blackboxnlp-1.8
Bibkey:
Cite (ACL):
Lukas Muttenthaler, Isabelle Augenstein, and Johannes Bjerva. 2020. Unsupervised Evaluation for Question Answering with Transformers. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 83–90, Online. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Evaluation for Question Answering with Transformers (Muttenthaler et al., BlackboxNLP 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.blackboxnlp-1.8.pdf
Data
SubjQA