Lydia Penkert
2025
Lessons from a User Experience Evaluation of NLP Interfaces
Eduardo Calò
|
Lydia Penkert
|
Saad Mahamood
Findings of the Association for Computational Linguistics: NAACL 2025
Human evaluations lay at the heart of evaluations within the field of Natural Language Processing (NLP). Seen as the “golden standard” of evaluations, questions are being asked on whether these evaluations are both reproducible and repeatable. One overlooked aspect is the design choices made by researchers when designing user interfaces (UIs). In this paper, four UIs used in past NLP human evaluations are assessed by UX experts, based on standardized human-centered interaction principles. Building on these insights, we derive several recommendations that the NLP community should apply when designing UIs, to enable more consistent human evaluation responses.