@inproceedings{mendonca-etal-2024-benchmarking,
    title = "On the Benchmarking of {LLM}s for Open-Domain Dialogue Evaluation",
    author = "Mendon{\c{c}}a, John  and
      Lavie, Alon  and
      Trancoso, Isabel",
    editor = "Nouri, Elnaz  and
      Rastogi, Abhinav  and
      Spithourakis, Georgios  and
      Liu, Bing  and
      Chen, Yun-Nung  and
      Li, Yu  and
      Albalak, Alon  and
      Wakaki, Hiromi  and
      Papangelis, Alexandros",
    booktitle = "Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2024.nlp4convai-1.1/",
    pages = "1--12",
    abstract = "Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots."
}Markdown (Informal)
[On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation](https://preview.aclanthology.org/ingest-emnlp/2024.nlp4convai-1.1/) (Mendonça et al., NLP4ConvAI 2024)
ACL