Tracy Ortman


2026

Large language models (LLMs) are increasingly deployed in monetization-driven systems such as search engines, advertising platforms, and e-commerce services, where decision making is shaped by complex interactions among user intent, advertiser objectives, and platform constraints. Despite rapid progress, existing benchmarks primarily focus on shopping-centric scenarios and user-facing data, capturing only a limited subset of real-world monetization pipelines and overlooking intermediate decision stages and robustness considerations. In this work, we introduce MonBench, a high-quality multi-task benchmark designed to evaluate LLMs in realistic monetization contexts. The benchmark is constructed from large-scale production data collected from multiple search engines, including both intermediate candidate pools and user-visible outcomes, better reflecting the distributional characteristics of real monetization systems. MonBench covers key capability dimensions such as intent understanding, commercial matching, and user behavior modeling, and adopts a unified multiple-choice formulation to enable systematic comparison across models. We further propose a comprehensive evaluation protocol that measures both performance and robustness. We evaluate a diverse set of state-of-the-art LLMs and conduct detailed task-level analyses. Our results reveal monetization-specific behaviors, including gaps between relevance optimization and broader decision-making capabilities, as well as differences in robustness across model families. These findings provide new insights into the strengths and limitations of current LLMs and highlight the need for richer domain-specific supervision in monetization-oriented applications.