2025
pdf
bib
abs
Evaluating LLMs’ Assessment of Mixed-Context Hallucination Through the Lens of Summarization
Siya Qi
|
Rui Cao
|
Yulan He
|
Zheng Yuan
Findings of the Association for Computational Linguistics: ACL 2025
With the rapid development of large language models (LLMs), LLM-as-a-judge has emerged as a widely adopted approach for text quality evaluation, including hallucination evaluation. While previous studies have focused exclusively on single-context evaluation (e.g., discourse faithfulness or world factuality), real-world hallucinations typically involve mixed contexts, which remains inadequately evaluated. In this study, we use summarization as a representative task to comprehensively evaluate LLMs’ capability in detecting mixed-context hallucinations, specifically distinguishing between factual and non-factual hallucinations. Through extensive experiments across direct generation and retrieval-based models of varying scales, our main observations are: (1) LLMs’ intrinsic knowledge introduces inherent biases in hallucination evaluation; (2) These biases particularly impact the detection of factual hallucinations, yielding a significant performance bottleneck; and (3) the fundamental challenge lies in effective knowledge utilization, balancing between LLMs’ intrinsic knowledge and external context for accurate mixed-context hallucination evaluation.
2024
pdf
bib
abs
Knowledge Generation for Zero-shot Knowledge-based VQA
Rui Cao
|
Jing Jiang
Findings of the Association for Computational Linguistics: EACL 2024
Previous solutions to knowledge-based visual question answering (K-VQA) retrieve knowledge from external knowledge bases and use supervised learning to train the K-VQA model.Recently pre-trained LLMs have been used as both a knowledge source and a zero-shot QA model for K-VQA and demonstrated promising results.However, these recent methods do not explicitly show the knowledge needed to answer the questions and thus lack interpretability.Inspired by recent work on knowledge generation from LLMs for text-based QA, in this work we propose and test a similar knowledge-generation-based K-VQA method, which first generates knowledge from an LLM and then incorporates the generated knowledge for K-VQA in a zero-shot manner. We evaluate our method on two K-VQA benchmarks and found that our method performs better than previous zero-shot K-VQA methods and our generated knowledge is generally relevant and helpful.
pdf
bib
abs
Recent Advances in Online Hate Speech Moderation: Multimodality and the Role of Large Models
Ming Shan Hee
|
Shivam Sharma
|
Rui Cao
|
Palash Nandi
|
Preslav Nakov
|
Tanmoy Chakraborty
|
Roy Ka-Wei Lee
Findings of the Association for Computational Linguistics: EMNLP 2024
Moderating hate speech (HS) in the evolving online landscape is a complex challenge, compounded by the multimodal nature of digital content. This survey examines recent advancements in HS moderation, focusing on the burgeoning role of large language models (LLMs) and large multimodal models (LMMs) in detecting, explaining, debiasing, and countering HS. We begin with a comprehensive analysis of current literature, uncovering how text, images, and audio interact to spread HS. The combination of these modalities adds complexity and subtlety to HS dissemination. We also identified research gaps, particularly in underrepresented languages and cultures, and highlight the need for solutions in low-resource settings. The survey concludes with future research directions, including novel AI methodologies, ethical AI governance, and the development of context-aware systems. This overview aims to inspire further research and foster collaboration towards responsible and human-centric approaches to HS moderation in the digital age.
2023
pdf
bib
abs
Modularized Zero-shot VQA with Pre-trained Models
Rui Cao
|
Jing Jiang
Findings of the Association for Computational Linguistics: ACL 2023
Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA).Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines.