A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist
Shaily Bhatt, Rahul Jain, Sandipan Dandapat, Sunayana Sitaram
Abstract
Despite state-of-the-art performance, NLP systems can be fragile in real-world situations. This is often due to insufficient understanding of the capabilities and limitations of models and the heavy reliance on standard evaluation benchmarks. Research into non-standard evaluation to mitigate this brittleness is gaining increasing attention. Notably, the behavioral testing principle ‘Checklist’, which decouples testing from implementation revealed significant failures in state-of-the-art models for multiple tasks. In this paper, we present a case study of using Checklist in a practical scenario. We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist. We lay out the challenges and open questions based on our observations of using Checklist for human-in-loop evaluation and improvement of NLP systems. Disclaimer: The paper contains examples of content with offensive language. The examples do not represent the views of the authors or their employers towards any person(s), group(s), practice(s), or entity/entities.- Anthology ID:
- 2021.humeval-1.14
- Volume:
- Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)
- Month:
- April
- Year:
- 2021
- Address:
- Online
- Editors:
- Anya Belz, Shubham Agarwal, Yvette Graham, Ehud Reiter, Anastasia Shimorina
- Venue:
- HumEval
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 120–130
- Language:
- URL:
- https://aclanthology.org/2021.humeval-1.14
- DOI:
- Cite (ACL):
- Shaily Bhatt, Rahul Jain, Sandipan Dandapat, and Sunayana Sitaram. 2021. A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 120–130, Online. Association for Computational Linguistics.
- Cite (Informal):
- A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist (Bhatt et al., HumEval 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2021.humeval-1.14.pdf