A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist

Shaily Bhatt; Rahul Jain; Sandipan Dandapat; Sunayana Sitaram

A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist

Shaily Bhatt, Rahul Jain, Sandipan Dandapat, Sunayana Sitaram

Abstract

Despite state-of-the-art performance, NLP systems can be fragile in real-world situations. This is often due to insufficient understanding of the capabilities and limitations of models and the heavy reliance on standard evaluation benchmarks. Research into non-standard evaluation to mitigate this brittleness is gaining increasing attention. Notably, the behavioral testing principle ‘Checklist’, which decouples testing from implementation revealed significant failures in state-of-the-art models for multiple tasks. In this paper, we present a case study of using Checklist in a practical scenario. We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist. We lay out the challenges and open questions based on our observations of using Checklist for human-in-loop evaluation and improvement of NLP systems. Disclaimer: The paper contains examples of content with offensive language. The examples do not represent the views of the authors or their employers towards any person(s), group(s), practice(s), or entity/entities.

Anthology ID:: 2021.humeval-1.14
Volume:: Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)
Month:: April
Year:: 2021
Address:: Online
Editors:: Anya Belz, Shubham Agarwal, Yvette Graham, Ehud Reiter, Anastasia Shimorina
Venue:: HumEval
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 120–130
Language:
URL:: https://aclanthology.org/2021.humeval-1.14
DOI:
Bibkey:
Cite (ACL):: Shaily Bhatt, Rahul Jain, Sandipan Dandapat, and Sunayana Sitaram. 2021. A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 120–130, Online. Association for Computational Linguistics.
Cite (Informal):: A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist (Bhatt et al., HumEval 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-5/2021.humeval-1.14.pdf
Video:: https://www.youtube.com/watch?v=fjkKVUZHJRQ
Video:: https://preview.aclanthology.org/nschneid-patch-5/2021.humeval-1.14.mp4

PDF Search Video Video