Yao Liu

2025

pdf bib abs
Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph
Yibo Zhao | Jiapeng Zhu | Can Xu | Yao Liu | Xiang Li
Findings of the Association for Computational Linguistics: ACL 2025

The rapid growth of social media platforms has raised significant concerns regarding online content toxicity. When Large Language Models (LLMs) are used for toxicity detection, two key challenges emerge: 1) the absence of domain-specific toxicity knowledge leads to false negatives; 2) the excessive sensitivity of LLMs to toxic speech results in false positives, limiting freedom of speech. To address these issues, we propose a novel method called *MetaTox*, leveraging graph search on a meta-toxic knowledge graph to enhance hatred and toxicity detection. First, we construct a comprehensive meta-toxic knowledge graph by utilizing LLMs to extract toxic information through a three step pipeline. Second, we query the graph via retrieval and ranking processes to supplement accurate, relevant toxicity knowledge. Extensive experiments and case studies across multiple datasets demonstrate that our MetaTox boosts overall toxicity detection performance, particularly in out-of-domain settings. In addition, under in-domain scenarios, we surprisingly find that small language models are more competent. Our code is available at https://github.com/YiboZhao624/MetaTox.

Generative methods significantly advance event argument extraction by probabilistically generating event argument sequences in a structured format. However, existing approaches primarily rely on a single prompt to generate event arguments in a fixed, predetermined order. Such a rigid approach overlooks the complex structural and dynamic interdependencies among event arguments. In this work, we present GEMS, a multi-prompt learning framework that Generates Event arguments via Multi-perspective prompts and ontology Steering. Specifically, GEMS utilizes multiple unfilled prompts for each sentence, predicting event arguments in varying sequences to explicitly capture the interrelationships between arguments. These predictions are subsequently aggregated using a voting mechanism. Furthermore, an ontology-driven steering mechanism is proposed to ensure that the generated arguments are contextually appropriate and consistent with event-specific knowledge. Extensive experiments on two benchmark datasets demonstrate that GEMS achieves state-of-the-art performance, particularly in low-resource settings. The source code is available at: https://github.com/AONE-NLP/EAE-GEMS

2024

pdf bib abs
InstructGraph: Boosting Large Language Models via Graph-centric Instruction Tuning and Preference Alignment
Jianing Wang | Junda Wu | Yupeng Hou | Yao Liu | Ming Gao | Julian McAuley
Findings of the Association for Computational Linguistics: ACL 2024

Do current large language models (LLMs) better solve graph reasoning and generation tasks with parameter updates? In this paper, we propose InstructGraph, a framework that empowers LLMs with the abilities of graph reasoning and generation by instruction tuning and preference alignment. Specifically, we first propose a structured format verbalizer to unify all graph data into a universal code-like format, which can simply represent the graph without any external graph-specific encoders. Furthermore, a graph instruction tuning stage is introduced to guide LLMs in solving graph reasoning and generation tasks. Finally, we identify potential hallucination problems in graph tasks and sample negative instances for preference alignment, the target of which is to enhance the output’s reliability of the model. Extensive experiments across multiple graph-centric tasks exhibit that InstructGraph can achieve the best performance and outperform GPT-4 and LLaMA2 by more than 13% and 38%, respectively.

pdf bib abs
Cognitive Bias in Decision-Making with LLMs
Jessica Maria Echterhoff | Yao Liu | Abeer Alessa | Julian McAuley | Zexue He
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) offer significant potential as tools to support an expanding range of decision-making tasks. Given their training on human (created) data, LLMs have been shown to inherit societal biases against protected groups, as well as be subject to bias functionally resembling cognitive bias. Human-like bias can impede fair and explainable decisions made with LLM assistance. Our work introduces BiasBuster, a framework designed to uncover, evaluate, and mitigate cognitive bias in LLMs, particularly in high-stakes decision-making tasks. Inspired by prior research in psychology and cognitive science, we develop a dataset containing 13,465 prompts to evaluate LLM decisions on different cognitive biases (e.g., prompt-induced, sequential, inherent). We test various bias mitigation strategies, while proposing a novel method utilizing LLMs to debias their own human-like cognitive bias within prompts. Our analysis provides a comprehensive picture of the presence and effects of cognitive bias across commercial and open-source models. We demonstrate that our selfhelp debiasing effectively mitigates model answers that display patterns akin to human cognitive bias without having to manually craft examples for each bias.

2023

Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements.