2025
pdf
bib
abs
Overview of Dialog System Evaluation Track: Dimensionality, Language, Culture and Safety at DSTC 12
John Mendonça
|
Lining Zhang
|
Rahul Mallidi
|
Alon Lavie
|
Isabel Trancoso
|
Luis Fernando D’Haro
|
João Sedoc
Proceedings of the Twelfth Dialog System Technology Challenge
The rapid advancement of Large Language Models (LLMs) has intensified the need for robust dialogue system evaluation, yet comprehensive assessment remains challenging. Traditional metrics often prove insufficient, and safety considerations are frequently narrowly defined or culturally biased. The DSTC12 Track 1, “Dialog System Evaluation: Dimensionality, Language, Culture and Safety,” is part of the ongoing effort to address these critical gaps. The track comprised two subtasks: (1) Dialogue-level, Multi-dimensional Automatic Evaluation Metrics, and (2) Multilingual and Multicultural Safety Detection. For Task 1, focused on 10 dialogue dimensions, a Llama-3-8B baseline achieved the highest average Spearman’s correlation (0.1681), indicating substantial room for improvement. In Task 2, while participating teams significantly outperformed a Llama-Guard-3-1B baseline on the multilingual safety subset (top ROC-AUC 0.9648), the baseline proved superior on the cultural subset (0.5126 ROC-AUC), highlighting critical needs in culturally-aware safety. This paper describes the datasets and baselines provided to participants, as well as submission evaluation results for each of the two proposed subtasks.
2024
pdf
bib
abs
The 2024 GEM Shared Task on Multilingual Data-to-Text Generation and Summarization: Overview and Preliminary Results
Simon Mille
|
João Sedoc
|
Yixin Liu
|
Elizabeth Clark
|
Agnes Johanna Axelsson
|
Miruna Adriana Clinciu
|
Yufang Hou
|
Saad Mahamood
|
Ishmael Nyunya Obonyo
|
Lining Zhang
Proceedings of the 17th International Natural Language Generation Conference: Generation Challenges
We present an overview of the GEM 2024 shared task, which comprised of both data-to-text generation and summarization. New datasets were compiled specifically for the task to reduce data contamination in the large language models, which the participants were likely to use. The paper describes the tasks, the datasets, the participating systems, the evaluation methods, and some preliminary results. The full results will be presented at INLG ‘24.
pdf
bib
abs
On the Role of Summary Content Units in Text Summarization Evaluation
Marcel Nawrath
|
Agnieszka Nowak
|
Tristan Ratz
|
Danilo Walenta
|
Juri Opitz
|
Leonardo Ribeiro
|
João Sedoc
|
Daniel Deutsch
|
Simon Mille
|
Yixin Liu
|
Sebastian Gehrmann
|
Lining Zhang
|
Saad Mahamood
|
Miruna Clinciu
|
Khyathi Chandu
|
Yufang Hou
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs areconcise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluation, Zhang and Bansal (2021) show that SCUs can be approximated by automatically generated semantic role triplets (STUs). However, several questions currently lack answers, in particular: i) Are there other ways of approximating SCUs that can offer advantages?ii) Under which conditions are SCUs (or their approximations) offering the most value? In this work, we examine two novel strategiesto approximate SCUs: generating SCU approximations from AMR meaning representations (SMUs) and from large language models (SGUs), respectively. We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs. We also show through a simple sentence-decomposition baseline (SSUs) that SCUs (and their approximations) offer the most value when rankingshort summaries, but may not help as much when ranking systems or longer summaries.
2023
pdf
bib
abs
A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization
Lining Zhang
|
Simon Mille
|
Yufang Hou
|
Daniel Deutsch
|
Elizabeth Clark
|
Yixin Liu
|
Saad Mahamood
|
Sebastian Gehrmann
|
Miruna Clinciu
|
Khyathi Raghavi Chandu
|
João Sedoc
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization. Thus, we investigate the recruitment of high-quality Amazon Mechanical Turk workers via a two-step pipeline. We show that we can successfully filter out subpar workers before they carry out the evaluations and obtain high-agreement annotations with similar constraints on resources. Although our workers demonstrate a strong consensus among themselves and CloudResearch workers, their alignment with expert judgments on a subset of the data is not as expected and needs further training in correctness. This paper still serves as a best practice for the recruitment of qualified annotators in other challenging annotation tasks.
pdf
bib
abs
Common Law Annotations: Investigating the Stability of Dialog System Output Annotations
Seunggun Lee
|
Alexandra DeLucia
|
Nikita Nangia
|
Praneeth Ganedi
|
Ryan Guan
|
Rubing Li
|
Britney Ngaw
|
Aditya Singhal
|
Shalaka Vaidya
|
Zijun Yuan
|
Lining Zhang
|
João Sedoc
Findings of the Association for Computational Linguistics: ACL 2023
Metrics for Inter-Annotator Agreement (IAA), like Cohen’s Kappa, are crucial for validating annotated datasets. Although high agreement is often used to show the reliability of annotation procedures, it is insufficient to ensure or reproducibility. While researchers are encouraged to increase annotator agreement, this can lead to specific and tailored annotation guidelines. We hypothesize that this may result in diverging annotations from different groups. To study this, we first propose the Lee et al. Protocol (LEAP), a standardized and codified annotation protocol. LEAP strictly enforces transparency in the annotation process, which ensures reproducibility of annotation guidelines. Using LEAP to annotate a dialog dataset, we empirically show that while research groups may create reliable guidelines by raising agreement, this can cause divergent annotations across different research groups, thus questioning the validity of the annotations. Therefore, we caution NLP researchers against using reliability as a proxy for reproducibility and validity.
2022
pdf
bib
abs
Probing GPT-3’s Linguistic Knowledge on Semantic Tasks
Lining Zhang
|
Mengchen Wang
|
Liben Chen
|
Wenxin Zhang
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
GPT-3 has attracted much attention from both academia and industry. However, it is still unclear what GPT-3 has understood or learned especially in linguistic knowledge. Some studies have shown linguistic phenomena including negation and tense are hard to be recognized by language models such as BERT. In this study, we conduct probing tasks focusing on semantic information. Specifically, we investigate GPT-3’s linguistic knowledge on semantic tasks to identify tense, the number of subjects, and the number of objects for a given sentence. We also experiment with different prompt designs and temperatures of the decoding method. Our experiment results suggest that GPT-3 has acquired linguistic knowledge to identify certain semantic information in most cases, but still fails when there are some types of disturbance happening in the sentence. We also perform error analysis to summarize some common types of mistakes that GPT-3 has made when dealing with certain semantic information.