A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Xianren Zhang; Shreyas Prasad; Di Wang; Qiuhai Zeng; Suhang Wang; Wenbo Yan; Mat Hans

A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans

Abstract

Web agents have shown great promise in performing many tasks on e-commerce websites. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., "Find an Apple Watch"), failing to capture the broader range of functionalities offered by real-world e-commerce services such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user’s account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wishlist management, and brand store following. To enhance agent evaluation, we propose an automated evaluation framework that assesses both the performance and safety of web agents. We systematically evaluate various agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.

Anthology ID:: 2026.acl-long.68
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1512–1528
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.68/
DOI:
Bibkey:
Cite (ACL):: Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, and Mat Hans. 2026. A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1512–1528, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains (Zhang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.68.pdf
Checklist:: 2026.acl-long.68.checklist.pdf

PDF Cite Search Checklist Fix data