Yi-Li Hsu

2026

Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols—details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023–2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research.

pdf bib abs

As AI systems increasingly mediate everyday communication, large language models (LLMs) are expected not only to provide factually accurate responses but also to generate explanations that engage with users’ mental states. We build on the concept of cognitive chains—structured representations of Situation, Clue, Thought, Action, and Emotion inspired by Theory of Mind—to investigate whether conditioning LLM outputs on such belief chains improves explanation quality. Specifically, we evaluate explanations along six reader-perceived dimensions: overall quality, logical correctness, completeness, conciseness, empathy, and agreement. Prior work shows that LLM explanations often default to neutral or uncertain stances, while individuals holding strong false beliefs remain highly resistant to correction. To address this challenge, we instantiate cognitive chains from two perspectives: believers and non-believers of the news claims. Using GPT-4.1 as a role-player across these stances, we find that incorporating believers’ chains improves the perceived quality of explanations for audiences with misinformation-aligned beliefs. Our findings underscore the importance of modeling diverse mental states in explanation generation and provide the first systematic evidence that Theory-of-Mind–based cognitive chains enhance the persuasiveness of explanations in misinformation contexts.

2024

pdf bib abs

We introduce Enhancing Perception, a framework for Large Language Models (LLMs) designed to streamline the time-intensive task typically undertaken by professional fact-checkers of crafting explanations for fake news. This study investigates the effectiveness of enhancing LLM explanations through conversational refinement. We compare various questioner agents, including state-of-the-art LLMs like GPT-4, Claude 2, PaLM 2, and 193 American participants acting as human questioners. Based on the histories of these refinement conversations, we further generate comprehensive summary explanations. We evaluated the effectiveness of these initial, refined, and summary explanations across 40 news claims by involving 2,797 American participants, measuring their self-reported belief change regarding both real and fake claims after receiving the explanations. Our findings reveal that, in the context of fake news, explanations that have undergone conversational refinement—whether by GPT-4 or human questioners, who ask more diverse and detail-oriented questions—were significantly more effective than both the initial unrefined explanations and the summary explanations. Moreover, these refined explanations achieved a level of effectiveness comparable to that of expert-written explanations. The results highlight the potential of automatic explanation refinement by LLMs in debunking fake news claims.

2023

pdf bib abs

Is Explanation the Cure? Misinformation Mitigation in the Short Term and Long Term
Yi-Li Hsu | Shih-Chieh Dai | Aiping Xiong | Lun-Wei Ku
Findings of the Association for Computational Linguistics: EMNLP 2023

With advancements in natural language processing (NLP) models, automatic explanation generation has been proposed to mitigate misinformation on social media platforms in addition to adding warning labels to identified fake news. While many researchers have focused on generating good explanations, how these explanations can really help humans combat fake news is under-explored. In this study, we compare the effectiveness of a warning label and the state-of- the-art counterfactual explanations generated by GPT-4 in debunking misinformation. In a two-wave, online human-subject study, participants (N = 215) were randomly assigned to a control group in which false contents are shown without any intervention, a warning tag group in which the false claims were labeled, or an explanation group in which the false contents were accompanied by GPT-4 generated explanations. Our results show that both interventions significantly decrease participants’ self-reported belief in fake claims in an equivalent manner for the short-term and long-term. We discuss the implications of our findings and directions for future NLP-based misinformation debunking strategies.

pdf bib abs

Label-Aware Hyperbolic Embeddings for Fine-grained Emotion Classification
Chih Yao Chen | Tun Min Hung | Yi-Li Hsu | Lun-Wei Ku
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Fine-grained emotion classification (FEC) is a challenging task. Specifically, FEC needs to handle subtle nuance between labels, which can be complex and confusing. Most existing models only address text classification problem in the euclidean space, which we believe may not be the optimal solution as labels of close semantic (e.g., afraid and terrified) may not be differentiated in such space, which harms the performance. In this paper, we propose HypEmo, a novel framework that can integrate hyperbolic embeddings to improve the FEC task. First, we learn label embeddings in the hyperbolic space to better capture their hierarchical structure, and then our model projects contextualized representations to the hyperbolic space to compute the distance between samples and labels. Experimental results show that incorporating such distance to weight cross entropy loss substantially improve the performance on two benchmark datasets, with around 3% improvement compared to previous state-of-the-art, and could even improve up to 8.6% when the labels are hard to distinguish. Code is available at https://github.com/dinobby/HypEmo.