Abstract
The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In this short paper, we present SAS, a cross-encoder-based metric for the estimation of semantic answer similarity, and compare it to seven existing metrics. To this end, we create an English and a German three-way annotated evaluation dataset containing pairs of answers along with human judgment of their semantic similarity, which we release along with an implementation of the SAS metric and the experiments. We find that semantic similarity metrics based on recent transformer models correlate much better with human judgment than traditional lexical similarity metrics on our two newly created datasets and one dataset from related work.- Anthology ID:
- 2021.mrqa-1.15
- Volume:
- Proceedings of the 3rd Workshop on Machine Reading for Question Answering
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Adam Fisch, Alon Talmor, Danqi Chen, Eunsol Choi, Minjoon Seo, Patrick Lewis, Robin Jia, Sewon Min
- Venue:
- MRQA
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 149–157
- Language:
- URL:
- https://aclanthology.org/2021.mrqa-1.15
- DOI:
- 10.18653/v1/2021.mrqa-1.15
- Cite (ACL):
- Julian Risch, Timo Möller, Julian Gutsch, and Malte Pietsch. 2021. Semantic Answer Similarity for Evaluating Question Answering Models. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 149–157, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Semantic Answer Similarity for Evaluating Question Answering Models (Risch et al., MRQA 2021)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2021.mrqa-1.15.pdf