2025
pdf
bib
abs
AD-LLM: Benchmarking Large Language Models for Anomaly Detection
Tiankai Yang
|
Yi Nian
|
Li Li
|
Ruiyao Xu
|
Yuangang Li
|
Jiaqi Li
|
Zhuo Xiao
|
Xiyang Hu
|
Ryan A. Rossi
|
Kaize Ding
|
Xia Hu
|
Yue Zhao
Findings of the Association for Computational Linguistics: ACL 2025
Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs’ pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.
pdf
bib
abs
Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation
Mahnaz Koupaee
|
Jake W. Vincent
|
Saab Mansour
|
Igor Shalyminov
|
Han He
|
Hwanjun Song
|
Raphael Shu
|
Jianfeng He
|
Yi Nian
|
Amy Wing-mei Wong
|
Kyu J. Han
|
Hang Su
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Faithfulness evaluators based on Large Language Models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries, usually leading to high false negative rate. We propose an approach to summary faithfulness evaluation in which multiple LLM-based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi-round debate to reach an agreement. The uniformly distributed initial assignments here result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension ambiguity and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.
pdf
bib
abs
The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets
Shenzhe Zhu
|
Jiao Sun
|
Yi Nian
|
Tobin South
|
Alex Pentland
|
Jiaxin Pei
Proceedings of the Natural Legal Language Processing Workshop 2025
AI agents are increasingly used in consumer-facing applications to assist with tasks such as product search, negotiation, and transaction execution. In this paper, we investigate a future setting where both consumers and merchants authorize AI agents to automate the negotiations and transactions in consumer settings. We aim to address two questions: (1) Do different LLM agents exhibit varying performances when making deals on behalf of their users? (2) What are the potential risks when we use AI agents to fully automate negotiations and deal-making in consumer settings? We designed an experimental framework to evaluate AI agents’ capabilities and performance in real-world negotiation and transaction scenarios, and experimented with a range of open-source and closed-source LLMs. Our analysis reveals that deal-making with LLM agents in consumer settings is an inherently imbalanced game: different AI agents have large disparities in obtaining the best deals for their users. Furthermore, we found that LLMs’ behavioral anomaly might lead to financial loss when deployed in real-world decision-making scenarios, such as overspending or making unreasonable deals. Our findings highlight that while automation can enhance transactional efficiency, it also poses nontrivial risks to consumer markets. Users should be careful when delegating business decisions to LLM agents.