Layer or Representation Space: What Makes BERT-based Evaluation Metrics Robust?

Doan Nam Long Vu, Nafise Sadat Moosavi, Steffen Eger


Abstract
The evaluation of recent embedding-based evaluation metrics for text generation is primarily based on measuring their correlation with human evaluations on standard benchmarks. However, these benchmarks are mostly from similar domains to those used for pretraining word embeddings. This raises concerns about the (lack of) generalization of embedding-based metrics to new and noisy domains that contain a different vocabulary than the pretraining data. In this paper, we examine the robustness of BERTScore, one of the most popular embedding-based metrics for text generation. We show that (a) an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, (b) taking embeddings from the first layer of pretrained models improves the robustness of all metrics, and (c) the highest robustness is achieved when using character-level embeddings, instead of token-based embeddings, from the first layer of the pretrained model.
Anthology ID:
2022.coling-1.300
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3401–3411
Language:
URL:
https://aclanthology.org/2022.coling-1.300
DOI:
Bibkey:
Cite (ACL):
Doan Nam Long Vu, Nafise Sadat Moosavi, and Steffen Eger. 2022. Layer or Representation Space: What Makes BERT-based Evaluation Metrics Robust?. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3401–3411, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Layer or Representation Space: What Makes BERT-based Evaluation Metrics Robust? (Vu et al., COLING 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2022.coling-1.300.pdf
Code
 long21wt/robust-bert-based-metrics