2023
pdf
abs
Are Experts Needed? On Human Evaluation of Counselling Reflection Generation
Zixiu Wu
|
Simone Balloccu
|
Ehud Reiter
|
Rim Helaoui
|
Diego Reforgiato Recupero
|
Daniele Riboni
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reflection is a crucial counselling skill where the therapist conveys to the client their interpretation of what the client said. Language models have recently been used to generate reflections automatically, but human evaluation is challenging, particularly due to the cost of hiring experts. Laypeople-based evaluation is less expensive and easier to scale, but its quality is unknown for reflections. Therefore, we explore whether laypeople can be an alternative to experts in evaluating a fundamental quality aspect: coherence and context-consistency. We do so by asking a group of laypeople and a group of experts to annotate both synthetic reflections and human reflections from actual therapists. We find that both laypeople and experts are reliable annotators and that they have moderate-to-strong inter-group correlation, which shows that laypeople can be trusted for such evaluations. We also discover that GPT-3 mostly produces coherent and consistent reflections, and we explore changes in evaluation results when the source of synthetic reflections changes to GPT-3 from the less powerful GPT-2.
2022
pdf
abs
Towards In-Context Non-Expert Evaluation of Reflection Generation for Counselling Conversations
Zixiu Wu
|
Simone Balloccu
|
Rim Helaoui
|
Diego Reforgiato Recupero
|
Daniele Riboni
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Reflection is an essential counselling strategy, where the therapist listens actively and responds with their own interpretation of the client’s words. Recent work leveraged pre-trained language models (PLMs) to approach reflection generation as a promising tool to aid counsellor training. However, those studies used limited dialogue context for modelling and simplistic error analysis for human evaluation. In this work, we take the first step towards addressing those limitations. First, we fine-tune PLMs on longer dialogue contexts for reflection generation. Then, we collect free-text error descriptions from non-experts about generated reflections, identify common patterns among them, and accordingly establish discrete error categories using thematic analysis. Based on this scheme, we plan for future work a mass non-expert error annotation phase for generated reflections followed by an expert-based validation phase, namely “whether a coherent and consistent response is a good reflection”.
2021
pdf
abs
Towards Low-Resource Real-Time Assessment of Empathy in Counselling
Zixiu Wu
|
Rim Helaoui
|
Diego Reforgiato Recupero
|
Daniele Riboni
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access
Gauging therapist empathy in counselling is an important component of understanding counselling quality. While session-level empathy assessment based on machine learning has been investigated extensively, it relies on relatively large amounts of well-annotated dialogue data, and real-time evaluation has been overlooked in the past. In this paper, we focus on the task of low-resource utterance-level binary empathy assessment. We train deep learning models on heuristically constructed empathy vs. non-empathy contrast in general conversations, and apply the models directly to therapeutic dialogues, assuming correlation between empathy manifested in those two domains. We show that such training yields poor performance in general, probe its causes, and examine the actual effect of learning from empathy contrast in general conversation.
2019
pdf
abs
Transformer-based Cascaded Multimodal Speech Translation
Zixiu Wu
|
Ozan Caglayan
|
Julia Ive
|
Josiah Wang
|
Lucia Specia
Proceedings of the 16th International Conference on Spoken Language Translation
This paper describes the cascaded multimodal speech translation systems developed by Imperial College London for the IWSLT 2019 evaluation campaign. The architecture consists of an automatic speech recognition (ASR) system followed by a Transformer-based multimodal machine translation (MMT) system. While the ASR component is identical across the experiments, the MMT model varies in terms of the way of integrating the visual context (simple conditioning vs. attention), the type of visual features exploited (pooled, convolutional, action categories) and the underlying architecture. For the latter, we explore both the canonical transformer and its deliberation version with additive and cascade variants which differ in how they integrate the textual attention. Upon conducting extensive experiments, we found that (i) the explored visual integration schemes often harm the translation performance for the transformer and additive deliberation, but considerably improve the cascade deliberation; (ii) the transformer and cascade deliberation integrate the visual modality better than the additive deliberation, as shown by the incongruence analysis.