Hua Li


2026

Large language models (LLMs) are increasingly deployed in monetization-driven systems such as search engines, advertising platforms, and e-commerce services, where decision making is shaped by complex interactions among user intent, advertiser objectives, and platform constraints. Despite rapid progress, existing benchmarks primarily focus on shopping-centric scenarios and user-facing data, capturing only a limited subset of real-world monetization pipelines and overlooking intermediate decision stages and robustness considerations. In this work, we introduce MonBench, a high-quality multi-task benchmark designed to evaluate LLMs in realistic monetization contexts. The benchmark is constructed from large-scale production data collected from multiple search engines, including both intermediate candidate pools and user-visible outcomes, better reflecting the distributional characteristics of real monetization systems. MonBench covers key capability dimensions such as intent understanding, commercial matching, and user behavior modeling, and adopts a unified multiple-choice formulation to enable systematic comparison across models. We further propose a comprehensive evaluation protocol that measures both performance and robustness. We evaluate a diverse set of state-of-the-art LLMs and conduct detailed task-level analyses. Our results reveal monetization-specific behaviors, including gaps between relevance optimization and broader decision-making capabilities, as well as differences in robustness across model families. These findings provide new insights into the strengths and limitations of current LLMs and highlight the need for richer domain-specific supervision in monetization-oriented applications.

2025

Given a task in the form of a basic description and its training examples, prompt optimization is the problem of synthesizing the given information into a text prompt for a large language model. Humans solve this problem by also considering the different facets that define a task (e.g., counter-examples, explanations, analogies) and including them in the prompt. However, it is unclear whether existing algorithmic approaches, based on iteratively editing a given prompt or automatically selecting a few in-context examples, can cover the multiple facets required to solve a complex task. In this work, we view prompt optimization as that of learning multiple facets of a task from a set of training examples. We exploit structure in the prompt optimization problem and break down a prompt into loosely coupled semantic sections. The proposed algorithm, UniPrompt, (1) clusters the input space and uses clustered batches so that each batch likely corresponds to a different facet of the task, and (2) utilizes a feedback mechanism to propose adding, editing or deleting a section, which in turn is aggregated over a batch to capture generalizable facets. Empirical evaluation on multiple datasets and a real-world task shows that prompts generated using UniPrompt obtain higher accuracy than human-tuned prompts and those from state-of-the-art methods. In particular, our algorithm can generate long, complex prompts that existing methods are unable to generate.

2022

In a leading e-commerce business, we receive hundreds of millions of customer feedback from different text communication channels such as product reviews. The feedback can contain rich information regarding customers’ dissatisfaction in the quality of goods and services. To harness such information to better serve customers, in this paper, we created a machine learning approach to automatically identify product issues and uncover root causes from the customer feedback text. We identify issues at two levels: coarse grained (L-Coarse) and fine grained (L-Granular). We formulate this multi-level product issue identification problem as a seq2seq language generation problem. Specifically, we utilize transformer-based seq2seq models due to their versatility and strong transfer-learning capability. We demonstrate that our approach is label efficient and outperforms the traditional approach such as multi-class multi-label classification formulation. Based on human evaluation, our fine-tuned model achieves 82.1% and 95.4% human-level performance for L-Coarse and L-Granular issue identification, respectively. Furthermore, our experiments illustrate that the model can generalize to identify unseen L-Granular issues.