Kaiyuan Zhang

2025

Accurately evaluating the word sense disambiguation (WSD) capabilities of large language models (LLMs) remains challenging, as existing studies primarily rely on single-task evaluations and classification-based metrics that overlook the fundamental differences between generative LLMs and traditional classification models. To bridge this gap, we proposeRoDEval, the first comprehensive evaluation framework specifically tailored for assessing LLM-based WSD methods. RoDEval introduces four novel metrics: Disambiguation Scope, Disambiguation Robustness, Disambiguation Reliability, and Definition Generation Quality Score, enabling a multifaceted evaluation of LLMs’ WSD capabilities. Experimental results using RoDEval across five mainstream LLMs uncover significant limitations in their WSD performance. Specifically, incorrect definition selections in multiple-choice WSD tasks stem not from simple neglect or forget of correct options, but rather from incomplete acquisition of the all senses for polysemous words. Instead, disambiguation reliability is often compromised by the models’ persistent overconfidence. In addition, inherent biases continue to affect performance, and scaling up model parameters alone fails to meaningfully enhance their ability to generate accurate sense definitions. These findings provide actionable insights for enhancing LLMs’ WSD capabilities. The source code and evaluation scripts are open-sourced at https://github.com/DayDream405/RoDEval.

Word sense disambiguation (WSD) is a fundamental yet challenging task in natural language processing. In recent years, the advent of large language models (LLMs) has led to significant advancements in regular WSD tasks. However, most existing LLMs face two major issues that hinder their performance in WSD. Firstly, these models are often prone to misclassifying the correct meaning of an ambiguous word when confronted with contexts containing adversarial information. Secondly, there is a lack of sufficient adversarial WSD datasets, which severely limits the development and evaluation of adversarial WSD systems. To address these gaps, we propose a novel Multi-Agent Debate framework for Adversarial Word Sense Disambiguation (MADAWSD). The MADAWSD framework simulates a real-world debate environment where multiple agent roles, namely, the Debater, Moderator, Consensus-seeker, and Judge, engage in discussions about ambiguous words in the context of adversarial information. Through a collaborative mechanism among these agents, it achieves accurate WSD. Additionally, a novel dataset for Chinese adversarial WSD has been constructed, focusing on improving and evaluating the performance of WSD models in the Chinese language. Extensive experiments on both English and Chinese adversarial WSD datasets demonstrate that MADAWSD can seamlessly integrate with existing LLMs and significantly enhance their performance, showcasing broad generality and outstanding effectiveness.

With the increasing capabilities of Large Language Models (LLMs), the proliferation of AI-generated texts has become a serious concern. Given the diverse range of organizations providing LLMs, it is crucial for governments and third-party entities to identify the origin LLM of a given AI-generated text to enable accurate mitigation of potential misuse and infringement. However, existing detection methods, primarily designed to distinguish between human-generated and LLM-generated texts, often fail to accurately identify the origin LLM due to the high similarity of AI-generated texts from different LLMs. In this paper, we propose a novel black-box AI-generated text origin detection method, dubbed Profiler, which accurately predicts the origin of an input text by extracting distinct context inference patterns through calculating and analyzing novel context losses between the surrogate model’s output logits and the adjacent input context. Extensive experimental results show that Profiler outperforms 10 state-of-the-art baselines, achieving more than a 25% increase in AUC score on average across both natural language and code datasets when evaluated against five of the latest commercial LLMs under both in-distribution and out-of-distribution settings.

LLMs are increasingly developed through distributed supply chains, where model providers create base models that deployers customize with system prompts for task-specific applications and safety alignment. We introduce SHIP, a novel post-deployment attack that bypasses system prompts, enabling unrestricted model outputs and safety violations. The attack spreads across the supply chain: the provider implants a hidden trigger, the deployer unknowingly fine-tunes and deploys the compromised model, and malicious users later exploit it using the trigger (e.g., obtained via underground market), as real-world software supply chain breaches. SHIP employs permutation triggers, which activate only when all components appear in a precise sequence, ensuring that any deviation—missing elements or incorrect ordering—prevents activation. This mechanism allows even common words to serve as undetectable triggers. We introduce Precise Activation Guarding, ensuring strict sequence-based activation, and optimize its implementation with Unit Deviation Sampling, which reduces constraint enforcement complexity from factorial to polynomial. Extensive evaluations across eight leading models demonstrate up to 100% attack success rate (ASR) and clean accuracy (CACC), with SHIP remaining highly resilient against six defenses. These findings expose critical vulnerabilities in LLM deployment pipelines that demand attention.