A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist

Shaily Bhatt, Rahul Jain, Sandipan Dandapat, Sunayana Sitaram


Abstract
Despite state-of-the-art performance, NLP systems can be fragile in real-world situations. This is often due to insufficient understanding of the capabilities and limitations of models and the heavy reliance on standard evaluation benchmarks. Research into non-standard evaluation to mitigate this brittleness is gaining increasing attention. Notably, the behavioral testing principle ‘Checklist’, which decouples testing from implementation revealed significant failures in state-of-the-art models for multiple tasks. In this paper, we present a case study of using Checklist in a practical scenario. We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist. We lay out the challenges and open questions based on our observations of using Checklist for human-in-loop evaluation and improvement of NLP systems. Disclaimer: The paper contains examples of content with offensive language. The examples do not represent the views of the authors or their employers towards any person(s), group(s), practice(s), or entity/entities.
Anthology ID:
2021.humeval-1.14
Volume:
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)
Month:
April
Year:
2021
Address:
Online
Editors:
Anya Belz, Shubham Agarwal, Yvette Graham, Ehud Reiter, Anastasia Shimorina
Venue:
HumEval
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
120–130
Language:
URL:
https://aclanthology.org/2021.humeval-1.14
DOI:
Bibkey:
Cite (ACL):
Shaily Bhatt, Rahul Jain, Sandipan Dandapat, and Sunayana Sitaram. 2021. A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 120–130, Online. Association for Computational Linguistics.
Cite (Informal):
A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist (Bhatt et al., HumEval 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2021.humeval-1.14.pdf
Video:
 https://www.youtube.com/watch?v=fjkKVUZHJRQ
Video:
 https://preview.aclanthology.org/nschneid-patch-5/2021.humeval-1.14.mp4