Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Anthology ID:: 2026.acl-industry
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Venue:: ACL
Event:: Annual Meeting of the Association for Computational Linguistics (2026)
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-industry/
DOI:
ISBN:: 979-8-89176-394-4
Bib Export formats:: BibTeX

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Yunyao Li | Georg Rehm | Mei Tu

pdf bib

ACL 2026 Industry Track: Overview
Yunyao Li | Georg Rehm | Mei Tu

pdf bib abs

Smarter, not Bigger: Fine-Tuned RAG-Enhanced LLMs for Automotive Hardware-in-the-Loop Testing
Chao Feng | Zihan Liu | Siddhant Gupta | Jan von der Assen

Hardware-in-the-Loop (HIL) testing is essential for automotive validation but suffers from fragmented and underutilized test artifacts. This paper presents HIL-GPT, an industry-deployed retrieval-augmented generation (RAG) system that integrates semantic retrieval with domain-adapted large language models to support test engineers in real-world HIL workflows. The system combines domain-specific embeddings to enable traceable retrieval of test cases and requirements under industrial latency and cost constraints. Through empirical evaluation, we show that compact, domain-adapted models can achieve a favorable trade-off among accuracy, latency, and cost compared to larger general-purpose models, challenging the assumption that larger models are always preferable in industrial NLP systems. An A/B user study further confirms that HIL-GPT improves perceived helpfulness, truthfulness, and satisfaction over general-purpose LLMs.

pdf bib abs

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

pdf bib abs

Leveraging Generative AI for Extracting Business Requirements from Legacy COBOL and PL/I Code
Ankur Kalohia

Recovering business requirements fromCOBOL and PL/I portfolios is difficult becauselogic is scattered across interdependentprograms and data definitions, and existinganalyses seldom yield stakeholder-facingartifacts. We introduce an LLM-augmentedreverse-engineering pipeline that providesdeterministic parsing, schema-constrainedLLM generation with bidirectional traceabilityto code. It couples grammar-based parsingand control-flow and data-flow analysis with alarge language model to translate an enrichedintermediate representation into structuredspecifications. This is not raw-code promptingor generic summarization, the novelty is theLLM-centered generation over an enriched IR,with structured JSON outputs and traceabilityfor compliance-sensitive settings. The pipelineproduces business requirements documents,explicit rule catalogs, end-to-end data lineage,create–read–update–delete matrices, and field-level source-to-target mappings, each linkedto the supporting code. In a financial industrysetting, containing 3.4M+ LoC includingcomments / 3.2M excluding comments ofCOBOL, the system achieves 93% agreementwith expert-authored business rules andreduces documentation effort by approximately70%, as measured against manually producedrequirement documents and rule sets. On theinternal corpus spanning 3.4M lines acrossonline, batch, and job control workloads, theapproach yields approximately 3.2–3.3× fasteranalysis while improving artifact consistencyand traceability.

pdf bib abs

Is Agentic RAG worth it? An experimental comparison of RAG approaches
Pietro Ferrazzi | Milica Cvjetićanin | Alessio Piraccini | Davide Giannuzzi

Retrieval-Augmented Generation (RAG) systems are usually defined by the combination of a generator and a retrieval component that extracts textual context from a knowledge base to answer user queries. However, such basic implementations exhibit several limitations, including noisy or suboptimal retrieval, misuse of retrieval for out-of-scope queries, weak query–document matching, and variability or cost associated with the generator. These shortcomings have motivated the development of "Enhanced" RAG, where dedicated modules are introduced to address specific weaknesses in the workflow.More recently, the growing self-reflective capabilities of Large Language Models (LLMs) have enabled a new paradigm, often referred to as "Agentic" RAG. In this approach, an LLM orchestrates the entire process, deciding which actions to perform, when to perform them, and whether to iterate. Despite the rapid adoption of both paradigms, it remains unclear which approach is preferable under which conditions.In this work, we conduct an empirically driven evaluation of "Enhanced" and "Agentic" RAG across multiple scenarios and dimensions. Our results provide practical insights into the trade-offs between the two paradigms, offering guidance on selecting the most effective RAG design for real-world applications, considering both performance and costs.

pdf bib abs

Improving Hate Speech Detection by Fusing Textual and User Interaction Representations in Online Communities
Xu Gao | Dong Jing | Kee-hung Lai

Detecting hate speech in online communities is increasingly challenging due to the implicit and context-dependent nature of toxic expressions. While text-only models often struggle with such ambiguity, incorporating user interaction signals offers critical pragmatic context for disambiguation. However, research in this direction is hindered by the scarcity of datasets that align textual content with comprehensive user behavioral graphs. To bridge this gap, we present a new dataset collected from a real-world community, featuring labeled hate speech enriched with fine-grained interaction histories. We further propose a novel user-aware hate speech detection framework that effectively fuses textual semantics with social interaction representations. Experiments demonstrate that our approach consistently outperforms strong text-only baselines by over 3.6%, validating the critical role of social context in enhancing detection accuracy. Furthermore, to mitigate real-world adversarial risks such as graph spoofing and spam, we introduce a contrastive graph augmentation strategy, ensuring model robustness against unreliable community behaviors.

pdf bib abs

The scalability of high-quality online education is hindered by the high costs and slow cycles of manual content creation.Despite advancements in video generation, current approaches often fail to ensure pedagogical structure and precise control due to their pixel-level, black-box nature.In this paper, we propose Generative Teaching, a novel paradigm shifting educators from manual creators to high-level directors who focus on pedagogical intents while agents handle the execution. To realize this vision, we introduce TeachMaster, a multi-agent framework that leverages code as an intermediate semantic medium. Unlike traditional video generation methods, TeachMaster orchestrates a collaborative team of agents, spanning planning, design, and rendering, to automate the production of interpretable, editable, and curriculum-ready educational videos. Experiments validate that TeachMaster significantly boosts production efficiency without compromising structural coherence or visual fidelity, slashing production costs to only 0.3% of traditional online course videos and providing a robust solution for scalable education.

pdf bib abs

Online advertising governance faces significant challenges due to the non-stationary nature of regulatory policies, where emerging mandates (e.g., restrictions on education or aesthetic anxiety) create severe label inconsistencies and reasoning ambiguities in historical datasets. In this paper, we propose ARGUS, a policy-adaptive governance system that enables evolving reinforcement through multi-agent adversarial umpiring. ARGUS addresses the sparsity of new policy data by employing a three-stage framework: (1) Policy Seeding for initial perception; (2) Adversarial Label Rectification, which utilizes a ”Prosecutor-Defender-Umpire” architecture to resolve conflicts between stale labels and new mandates; and (3) Latent Knowledge Discovery, which employs a tripartite dialectical discussion to unearth sophisticated, “gray-area” violations. By leveraging RAG-enhanced policy knowledge and Chain-of-Thought synthesis as dynamic rewards for reinforcement learning, ARGUS synchronizes its reasoning pathways with evolving regulations. Extensive experiments on both industrial and public datasets demonstrate that ARGUS significantly outperforms traditional fine-tuning baselines, achieving superior policy-adaptive learning with minimal gold data.

pdf bib abs

Synthetic Text Detection in the Age of Large Language Models: Watermark vs. Automatic Detection
Adaku Uchendu

Given the ubiquitous nature of Large Language Models (LLMs) and its impressive capabilities, malicious uses of this technology to generate harmful content have been observed. Thus, to mitigate this serious security risk LLMs pose, many researchers have proposed two techniques for detecting synthetic texts generated from LLMs - watermark and automatic detection. The idea with watermarking LLMs involves infusing generated content with algorithmically-identifiable patterns during generation. This makes accurate synthetic text detection achievable with watermark detection. While, for automatic detection, the focus is on using statistical and linguistic cues to reveal authorship of texts as human or LLM. Currently, both types of synthetic text detectors achieve state-of-the-art performance, however, the better detector is still unknown. To ascertain the better detection method, we evaluate each method on their performance on both unperturbed and perturbed (i.e., adversarially manipulated texts) data. We perform a comprehensive study across six different sizes of Qwen2.5 models, six watermark techniques and detectors, two automatic detectors, three authorship obfuscation methods for different levels of syntactic changes, and two datasets of different text lengths. Our results suggest that there is no detector that consistently outperforms on all scenarios. However, we observe that the (1) automatic detectors are better for short synthetic text detection; and (2) watermark detectors perform better defending against the word-level attack implemented.

pdf bib abs

Retrieval-augmented generation (RAG) plays a critical role in user-generated content (UGC) platforms, but its effectiveness critically depends on accurate query–document relevance assessment. Despite recent advances in applying large language models (LLMs) to relevance modeling, UGC platforms present unique challenges: 1) ambiguous user intent due to sparse user feedback in RAG scenarios, and 2) asymmetric relevance, where relevance is driven by localized answer-bearing content rather than global query–document similarity. To address these issues, we propose the Reinforced Reasoning model for Relevance Assessment (R³A), which decomposes relevance assessment into intent inference and evidence grounding. R³A leverages auxiliary high-clicked documents to infer latent query intent, and extracts verbatim evidence fragments to ground relevance decisions, reducing noise sensitivity and improving asymmetric relevance modeling. Experimental results demonstrate that R³A substantially outperforms strong baselines on offline benchmarks, while the distilled R³A-1.5B model achieves significant gains in large-scale online A/B testing, effectively balancing performance and practical deployability.

pdf bib abs

The increasing reliance on natural language generation (NLG) models, particularly large language models, has raised concerns about the reliability and accuracy of their outputs. A key challenge is hallucination, where models produce plausible but incorrect information. As a result, hallucination detection has become a critical task. In this work, we introduce a comprehensive hallucination taxonomy with 11 categories across various NLG tasks and propose the HAllucination Detection (HAD) models, which integrate hallucination detection, span-level identification, and correction into a single inference process. Trained on an elaborate synthetic dataset of about 90K samples, our HAD models are versatile and can be applied to various NLG tasks. We also carefully annotate a test set for hallucination detection, called HADTest, which contains 2,248 samples. Evaluations on in-domain and out-of-domain test sets show that our HAD models generally outperform the existing baselines, achieving state-of-the-art results on HaluEval, FactCHD, and FaithBench, confirming their robustness and versatility.

pdf bib abs

Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models’ ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets and evaluation code to facilitate future research in document packet processing.

pdf bib abs

TaoType: Predicting Fine-Grained Typing Intent for Faster Search
Yipeng Yu | Yichen Yuan | Chengxiao Feng | Xu Liu

"Is the user’s current query input exactly what they intend to search for?" Our work aims to answer this question by determining, at each typing, whether the current query is complete. If so, a search is implicitly triggered in advance without waiting for user confirmation. This approach reduces response time and enhances the user search experience. Specifically, we propose TaoType, a client-side framework that introduces innovation in data sampling, feature selection, model design and training, and online strategy. Experiments in a leading mobile shopping application named Taobao validate its effectiveness, achieving offline precision/recall/accuracy of 0.7936/0.8196/0.7742, respectively, and decreasing online response time by 640.51±93.65 milliseconds, which is of great benefit to the search system. Unlike prior work that focuses on optimizing server-side engineering pipelines or simplifying ranking models, our method leverages client-side typing behavior for real-time early prediction, utilizing on-device computation to gain response time reducing. To the best of our knowledge, our work is the first to identify and address this problem. This work also introduces App Intelligence, a new paradigm for enhancing mobile applications by integrating on-device AI to boost business value and user experience.

pdf bib abs

As LLM-based judges become integral to industry applications, obtaining well-calibrated uncertainty estimates efficiently has become critical for production deployment. However, existing techniques, such as verbalized confidence and multi-generation methods, are often either poorly calibrated or computationally expensive. We introduce linear probes trained with a Brier score-based loss to provide calibrated uncertainty estimates from reasoning judges’ hidden states, requiring no additional model training. We evaluate our approach on both objective tasks (reasoning, mathematics, factuality, coding) and subjective human preference judgments. Our results demonstrate that probes achieve superior calibration compared to existing methods with x computational savings, generalize robustly to unseen evaluation domains, and deliver higher accuracy on high-confidence predictions. However, probes produce conservative estimates that underperform on easier datasets but may benefit safety-critical deployments prioritizing low false-positive rates. Overall, our work demonstrates that interpretability-based uncertainty estimation provides a practical and scalable plug-and-play solution for LLM judges in production.

Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.

pdf bib abs

Conversational agents, such as ChatGPT and Doubao, have become essential daily assistants for billions of users. To further enhance engagement, these systems are evolving from passive responders to proactive companions. However, existing efforts focus on activation within ongoing dialogues, while overlooking a key real-world bottleneck. In the conversation initiation stage, users may have a vague need but no explicit query intent, creating a first-message barrier where the conversation holds before it begins. To overcome this, we introduce Conversation Starter Generation: generating personalized starters to guide users into conversation. However, unlike in-conversation stages where immediate context guides the response, initiation must operate in a cold-start moment without explicit user intent. To pioneer in this direction, we present IceBreaker that frames human ice-breaking as a two-step handshake: (i) evoke resonance via Resonance-Aware Interest Distillation from session summaries to capture trigger interests, and (ii) stimulate interaction via Interaction-Oriented Starter Generation, optimized with personalized preference alignment and a self-reinforced loop to maximize engagement. Online A/B tests on one of the world’s largest conversational agent products show that IceBreaker improves user active days by +1.84‰ and click-through rate by +94.25‰, and has been deployed in production.

pdf bib abs

The demand for efficient large language model inference has spurred interest in sparsification, yet current hardware support remains narrowly focused on 2:4 weight sparsity. In this work, we argue that activation sparsity despite being overlooked in hardware design offers a promising path for dynamic, input-adaptive compression with significant I/O and memory benefits. We present a comprehensive post-training study of N:M activation pruning across four LLMs (Llama2-7B-chat, Llama3.1-8B-Instruct, Qwen2.5-7B-Instruct, Gemma3-4B-Instruct), demonstrating that activation pruning consistently outperforms weight pruning at matched sparsity levels. We evaluate lightweight, plug-and-play error mitigation and selection strategies that require minimal or no calibration data across four sparsity patterns: 2:4, 4:8, 8:16, and 16:32. Among these, 16:32 approaches the performance of unstructured 50% sparsity and is is approximately 2.7× better than 2:4, while 8:16 offers an optimal balance of accuracy and practicality. Our results provide evidence that next-generation accelerators should consider native support for N:M activation sparsity and can serve as a strong baseline for the future methods.

pdf bib abs

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction.However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume.Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations.To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.Our code and project page are publicly available at https://anonymous-2001-j.github.io/sommelier.github.io/.

pdf bib abs

Multi-turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e-commerce assistance, and interactive task management, where accurately distinguishing high-value informative signals from stochastic noise is critical for sample-efficient training. In real-world scenarios, a failure in a trivial task may reflect random instability, whereas success in a high-difficulty task signifies a genuine capability breakthrough. Yet, existing group-based policy optimization methods rigidly rely on statistical deviation within discrete batches, frequently misallocating credit when task difficulty fluctuates. To address this issue, we propose Proximity-based Multi-turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real-world deployment. ProxMO integrates global context via two lightweight mechanisms: success-rate-aware modulation dynamically adapts gradient intensity based on episode-level difficulty, while proximity-based soft aggregation derives baselines through continuous semantic weighting at the step level. Extensive evaluations on ALFWorld and WebShop benchmarks demonstrate that ProxMO yields substantial performance gains over existing baselines with negligible computational cost. Ablation studies further validate the independent and synergistic efficacy of both mechanisms. Crucially, ProxMO offers plug-and-play compatibility with standard GRPO frameworks, facilitating immediate, low-friction adoption in existing industrial training pipelines. Our implementation is available at: https://github.com/GithubX-F/ProxMO-RL.

pdf bib abs

Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes, creating a rigidity that fails to align computational costs with varying token-level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system-level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment.To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two-level routing mechanism to enable flexible, resource-aware expert combinations. To optimize inference efficiency, we propose a Group-Wise Auxiliary Loss, which dynamically steers tokens to the most parameter-efficient expert groups based on task difficulty.To address the critical deployment challenge of GPU load balancing, we introduce an All-size Group-decoupling Allocation strategy coupled with an Intra-Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs.Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource-efficient MoE design, offering a practical solution for optimizing inference costs in real-world scenarios.

pdf bib abs

Policy Compliance of User Requests in Natural Language for AI Systems
Pedro Cisneros-Velarde

Consider an organization whose users send requests in natural language to an AI system that fulfills them by carrying out specific tasks. In this paper, we consider the problem of ensuring such user requests comply with a list of diverse policies determined by the organization with the purpose of guaranteeing the safe and reliable use of the AI system. We propose, to the best of our knowledge, the first benchmark consisting of annotated user requests of diverse compliance with respect to a list of policies. Our benchmark is related to industrial applications in the technology sector. We then use our benchmark to evaluate the performance of various LLM models on policy compliance assessment under different solution methods. We analyze the differences on performance metrics across the models and solution methods, showcasing the challenging nature of our problem.

pdf bib abs

Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach
Zhihao Lin | Ziqi Zhu | Hao Huang | Guanghui Wang | Peiyang He

Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO’s online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).

pdf bib abs

While advertising is a cornerstone of commercial growth, it is constrained by online violation detection systems that reject non-compliant content at a million-scale daily. Advertisers urgently require automated solutions to rectify these advertisements, especially visual ads, as manual fixing is unscalable. Although recent safety-driven methods can achieve compliance, they typically suffer from over-editing, destroying the original commercial intent and perceptual similarity.To address this, we present SSR-A, a framework tailored for the minimalist rectification of non-compliant image ads.Instead of fine-tuning image editing models directly, SSR-A focuses on translating violation policies into targeted editing instructions.We first introduce a Spatial- and Semantic-Aware Instruction Synthesis Pipeline, where MLLMs synthesize candidate instructions—incorporating spatial grounding and semantic guidance—and select the optimal instruction via multi-dimensional evaluation. Furthermore, we align the model using Curriculum Reinforcement Learning, employing GRPO with multi-faceted rewards to progressively navigate the trade-off between compliance and visual preservation. Extensive experiments and online A/B tests show that SSR-A significantly outperforms state-of-the-art baselines in both compliance and preservation of visual and commercial consistency.

pdf bib abs

Rigorous content moderation is crucial for online advertising but leads to millions of daily rejections. This scale renders manual rectification infeasible, particularly for video advertisements.However, existing safety-driven methods often suffer from aggressive over-editing, which compromises the advertiser’s original semantic intent merely to satisfy compliance.In this work, we target the rectification of textual violations in video ads, covering both speech transcripts and on-screen text. We propose ℛ³, a novel framework designed to harmonize compliance with original semantic intent preservation.Our approach integrates three key innovations: (1) an experience-driven data synthesis framework that bootstraps high-quality supervision via group-**R**elative compliance experience extractor; (2) a curriculum **R**einforcement learning strategy with hierarchical rewards designed to enforce compliance while maximizing semantic consistency;and (3) a comprehensive video **R**ectification framework seamlessly integrating text recognition, rewriting, and re-rendering for industrial deployment. Extensive experiments on industrial datasets and online A/B testing demonstrate that ℛ³ significantly outperforms state-of-the-art baselines, achieving an optimal trade-off between violation rectification and intent preservation.

pdf bib abs

The rapid growth of online grocery shopping requires recommendation systems that capture cyclical purchasing behavior and diverse user intents. Traditional item-level methods face scalability and accuracy challenges, motivating category-level recommendation as a more structured and practical alternative. We present GrocLM, a fine-tuned language model for grocery category recommendation in a real-world production environment. GrocLM employs a two-stage LoRA-based training strategy to encode cyclical purchasing patterns directly into model parameters, enabling more effective utilization of rebuying signals compared to prompt-based conditioning. To ensure valid and controllable outputs, we further introduce a trie-based constrained decoding mechanism over a predefined category space. Experiments on both proprietary production data and a public benchmark demonstrate that GrocLM consistently outperforms strong baselines. In a live production restocking task, GrocLM achieves a 7.5% relative improvement in cart-adds per impression while maintaining efficient inference by generating all categories jointly. These results highlight the effectiveness and practicality of integrating large language models into structured recommendation systems.

pdf bib abs

DeepResearch Retail: Benchmarking Tool-Augmented Deep Research in the E-Commerce Domain
Rafael Ferreira | Flavio Di Palo | Huilin Lu | Ayush Jain | Harsha Aduri

Deep Research (DR) systems autonomously retrieve and synthesize information from web sources, however, industrial DR applications face a critical gap: effective integration of internal tools with web search. In this work, we introduce DeepResearch Retail, an evaluation framework grounded in real-world e-commerce data for assessing Deep Research with tools (DR+Tools) in realistic commercial settings. The framework evaluates both factual faithfulness and multidimensional response quality when reasoning over heterogeneous web and internal data sources.We further present Hybrid-ReAct, a multi-agent architecture that demonstrates how collaborative reasoning and tool use can produce evidence-grounded answers. Experimental results validate our framework’s utility, showing improvements in agent’s performance when leveraging web-page information and multi-agent specialization.

pdf bib abs

GeoGround: Uncertainty-Weighted Multi-Task Learning for Geo-Alignment and Address Defect Detection
Srinivas Virinchi | Aman Gulati | Anoop Saladi

Address intelligence in e-commerce demands accurate geocoding and proactive defect detection under strict sub-50 ms latency constraints. These tasks are inherently coupled: precise spatial grounding provides a strong prior for defect propensity, yet prior approaches optimize them independently. While generative LLMs offer rich semantic representations, they lack spatial inductive bias and fail to meet real-time serving requirements. We introduce GeoGround, a multi-task learning framework that jointly models coordinate grounding and address defect detection. The model combines a hierarchical spatial grounding objective with Focal Loss for defect classification, using uncertainty-based task weighting to balance optimization under severe class imbalance. To strengthen supervision, we curate a large-scale noisy address dataset using LLM-assisted data construction, augmenting the training corpus with signals that are costly to obtain manually. GeoGround achieves 5.86× gains in address defect detection precision and up to 4.86× improvements in location prediction accuracy over strong encoder baselines, while remaining 75× more efficient than decoder LLMs such as Qwen2-1.5B. A two-week online A/B test in a large-scale delivery pipeline confirms real-world impact, yielding a 50 bps uplift in defect detection, a 40 bps gain in location prediction, and an estimated operational savings of $3.09M annually.

pdf bib abs

Vision-Language Models (VLMs) perform well on general multimodal tasks, yet applying them to real-world advertisement (ad) evaluation is challenging due to strong brand specificity and limited labeled data. We introduce a new practical task, brand-specific ad ranking, which aims to rank ads for a target brand prior to deployment by modeling brand-specific effectiveness. To this end, we propose ADvisor, which derives explicit brand-aware decision criteria using VLMs, augments limited brand context with ads from similar brands, and applies reflection-based scoring for ranking. Experiments on real-world advertising data from 10 brands, collected through actual ad campaigns, show that ADvisor outperforms strong baselines by up to 7.2%. Further analyses show the generated criteria capture meaningful brand specificity, and ADvisor also performs strongly in online A/B testing. Our code is available at https://github.com/K-Kyungho/ADvisor.

pdf bib abs

While Large Language Models excel at reasoning and language understanding, they struggle with multi-step operational workflows requiring precise procedural adherence, which is fundamental for industrial automation. Existing SOP-guided agents assume well-defined procedures and structured APIs, failing to address enterprise realities like incomplete SOPs, dynamic web interfaces, and unpredictable document formats. We present Agent-Ops, an end-to-end multi-agent framework automating Standard Operating Procedures in e-commerce. Agent-Ops contributes: (1) SOP Groomer, a human-AI framework transforming ambiguous documentation into automation-ready specifications, improving accuracy by 13.2%, (2) WebAgent, achieving 91.3% task completion and 86.5% execution consistency through demonstration-based learning, and (3) a Document Verification Agent performing multi-lingual validation across tax invoices, certificates, and supply chain documents with 94.2% accuracy. Deployed across seven SOP categories in three geographic regions, Agent-Ops achieves 85-97% end-to-end accuracy while reducing case resolution from 30 to 5 minutes (83% reduction). Production deployment with over 1000 Account Managers validates that LLM-based agents achieve enterprise-grade reliability when augmented with robust web automation, comprehensive document understanding, and systematic SOP refinement.

pdf bib abs

Reason-Code: Reliable Code Generation via Test-Driven Monte Carlo Tree Search
Zixu Li | Zhiqi Peng

Large Language Models (LLMs) are widely used for code generation, but their performance degrades on tasks requiring multi-step logical reasoning. In practice, reliability is often improved through multi-sample inference, but its cost grows linearly with the sample size, making it impractical under strict latency constraints. To address this, we propose Reason-Code, an inference-time framework that formulates code generation as a search process guided by execution feedback. It integrates Monte Carlo Tree Search (MCTS) with a lightweight execution sandbox, where candidate programs are evaluated via unit tests. To control inference cost, Reason-Code adopts a conditional budgeting strategy that activates search only when greedy generation fails. Compared with large-sample Best-of-N sampling, Reason-Code is designed to improve reliability without paying the full linear cost of additional sampling under strict latency budgets. Experiments on HumanEval and MBPP show that Reason-Code matches strong sampling baselines (e.g., Best-of-10) with lower token cost and no regression. Additional matched-budget analyses show that execution-guided adaptive inference improves over independent sampling/filtering baselines, while differences between UCB-guided search and simpler iterative repair are limited at low budget.

pdf bib abs

Conversational AI is increasingly used at eBay to deliver personalized customer support. We present a production RAG-based How-To Assistant that answers support and how-to queries by grounding responses in a proprietary knowledge base. We study three factors that drive quality: (1) document chunking and contextualization for indexing, (2) query refinement methods, and (3) automatic LLM-based evaluation for rapid iteration and reliable measurement. We also describe the end-to-end system workflow - from offline indexing to real-time serving and report deployment metrics, offering practical guidance for building scalable, high-precision RAG assistants in commercial support settings.

pdf bib abs

Global-scale video moderation faces a dual challenge: the need for fine-grained multimodal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency.In this paper, we present UNIVID, a Unified Vision-Language model for Video Moderation. Unlike standard classification models, UNIVID generates policy-aware captions that serve as an interpretable intermediate representation, enabling human-verifiable decisions and multi-task reusability. While existing open-source and commercial VLMs often suffer from safety-guardrail refusals and lack fine-grained policy alignment, we develop a specialized training data recipe that combines expert human-refined labels with synthetic data to align the model with our safety guidelines.By integrating UNIVID as the core captioner, we design a novel end-to-end video moderation system that reduces violation leakage by 42.7% and overkill rate by 37.0% relatively. Meanwhile, by replacing over 1,000 policy-specific models with a single UNIVID backbone, we recycle extensive computational resources while significantly reducing engineering maintenance overhead. To our knowledge, this is one of the first reports of a high-efficiency captioning VLM successfully supporting industrial-scale moderation and cross-functional business.

pdf bib abs

While coronary imaging is widely used for anatomical assessment, CCTA reports play a distinct last-mile role in clinical care. Ratherthan serving as an intermediate signal, CCTA provides an assessment of coronary disease severity (known as the CAD-RADS score) toguide patient management. However, real-world clinical text exhibits substantial heterogeneity in terminology and structure, leadingto inconsistent interpretation by automated systems, even for clinically similar cases. Recent work leverages a direct application ofLLMs for automated CAD-RADS scoring, but is limited by small, non-public, and homogeneous clinical data. We introduce CCTA-RADS, the largest publicly available dataset of 940 real-world CCTA reports from a major cardiovascular center, each annotated with CAD-RADS scores. Our analysis reveals that direct approaches, including state-of-the-art LLMs (GPT-4o, GPT-o3) and fine-tuned BERT models underperform on diverse real-world clinical data. To address these limitations, we propose a two-stage pipeline that decouples structuring from classification: an LLM-based parser normalizes heterogeneous reports into structured format, followed by fine-tuned BERT classification. This approach substantially improves the F1-score by 6%-13% compared with direct methods. We deploy our system as an interactive web interface that allows clinicians to upload CCTA reports for automated CAD-RADS assessment with SHAP and LIME explainability visualizations.

pdf bib abs

We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guardrails (no over-promising and no hallucinations), while remaining human-like and effective over long, multi-turn dialogues.We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training method that combines heterogeneous rewards: a preference-trained reward model (RM), an LLM-as-a-judge (RJ) for nuanced behaviors (e.g., emotional value and SOP compliance), and rule-based reward functions (RF) (mainly regex-based) for deterministic checks on numerics, formatting, and guardrails. In expert consensus evaluation (three human experts; 30 online conversations and 45 curated bad cases), REPO improves average dialogue rating to 4.63 (+0.33 over GRPO) and raises the share of conversations with at least one excellent response to 66.67% (+23.34 pp over GRPO), while achieving a 93.33% bad-case fix rate with 75.56% clean fixes.In a production A/B test on 9,653 real customer conversations (vs. an intent-driven dialogue system), REPO improves response rate by +12.14 pp and task success rate by +5.94 pp (p<0.001).

pdf bib abs

NEST: Nested Evidence Survival for Retrieval
Akshay Verma | Siddharth Pillai | Prateek Sircar | Deepak Gupta

Retrieval-Augmented Generation (RAG) systems degrade sharply under extreme noise, where relevant evidence is sparse and easily pruned by static retrieval decisions. Existing approaches fixed top-k retrieval, hierarchical chunking, cross-encoder reranking, or policy-based iterative control- either rely on rigid heuristics or incur substantial computational overhead, and often fail to recover context-dependent evidence without introducing redundancy or latency. We introduce NEST (Nested Evidence Survival for Retrieval), a lightweight, training-free RAG framework that improves factual grounding by explicitly separating recall amplification from precision selection. NEST first maximizes recall through Nested Evidence Survival, evaluating candidates under nested retrieval contexts to rescue evidence that would otherwise be pruned by static chunking. It then applies a survival-consistent Mean Reciprocal Rank (MRR) selection mechanism to retain evidence that remains salient across retrieval scopes, removing redundancy without harming recall. Evaluated on WebQuestions, HotpotQA (distractor setting), and a proprietary InternalQA benchmark with 50M Common Crawl distractors, NEST consistently outperforms strong adaptive RAG baselines, including DeepRAG, improving EM by up to +2.4 pp and F1 by +2.1 pp, while increasing retrieval recall by +6.8 pp. These gains are achieved with only 12–18 ms additional latency. Ablation studies confirm that Nested Evidence Survival drives recall improvements, while MRR-based selection converts these gains into precision, demonstrating that recall-first retrieval with principled pruning can outperform iterative control and model scaling in retrieval-augmented generation.

pdf bib abs

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning.Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing.We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives.Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.

pdf bib abs

Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks.

pdf bib abs

Short-video platforms now present tappable search entries beneath the video player, making it effortless for users to shift from passively watching to actively searching for information. Prior work on bottom-bar query generation conditions on titles and OCR to generate a single query per forward pass, constrains decoding with a trie, and evaluates against a single reference using edit-distance–style supervision—making it difficult to cover the diverse intents a video can trigger and to credit semantically equivalent query variants. Motivated by these limitations, we propose four complementary improvements. First, we reformulate the task as one-shot list generation, producing multiple distinct queries per video, and build multi-query ground truth from exposure and CTR logs. Second, we redesign offline evaluation with \operatorname{CTR\text{-}HungF1}, a CTR-weighted set-matching metric via optimal assignment over token-level F1 score. Third, we enrich context with a video-to-video-to-query (V2V2Q) RAG pipeline to provide behavior-grounded background knowledge. Finally, we apply thinking-free RLVR with deterministic format checks and \operatorname{CTR\text{-}HungF1} rewards to train a compact LLM without reward models or CoT distillation. The resulting system yields strong offline and online improvements, and has been deployed on Kuaishou to serve hundreds of millions of users daily.

pdf bib abs

Small Agents, Big Gains: Journey-Aware and Critic-Guided Simulation for Long-Horizon Shopping Dialogues
Qing Ping | Changyou Chen | Binxuan Huang

Modern e-commerce assistants must go beyond simple product search to support inspiration, comparison, and tool-grounded fact-checking across non-linear shopping journeys. However, distilling these complex behaviors into efficient, deployable models is bottle-necked by a lack of post-training data: trajectories must cover diverse agentic workflows with high fidelity, yet the desired outputs are open-ended without a single ground truth. We propose a closed-loop Multi-Agent Simulation Framework to synthesize diverse, faithful, and policy-aligned shopping trajectories. The system orchestrates a journey-aware, stateful user simulator to drive exploration, a shopping agent that manages both tools and UI elements, and a critic agent that provides rubric-driven feedback to iteratively refine the data. On a domain-specific benchmark, this synthetic data enables a small model to significantly outperform same-size baselines and surpass a large-model baseline, achieving near-zero tool-calling errors with 8× higher inference throughput.

pdf bib abs

Progressive Fine-Tuning for Cost-Effective Structured Attribute Generation in E-commerce
Lakshman Kolasani | Fatemeh Taheri Dezaki

Large language models (LLMs) excel at structured information generation but face cost and latency challenges when deployed at scale in user-facing products. We present a parameter efficient supervised fine-tuning pipeline for adapting a small language model (SLM) to structured attribute generation in e-commerce product listing, enabling continuous model improvement with implicit user feedback without expensive manual annotation. Our approach involves completeness-deficit guided curation, which ranks samples by divergence between model predictions and catalog listing attributes, selecting the highest completeness gap examples for progressive fine-tuning. Our system is deployed on a large-scale product listing service, reducing inference costs by 98% and p90 latency by 70% using a fine-tuned SLM relative to the baseline LLM while preserving an 86.4% user acceptance rate, translating to significant monthly infrastructure savings.

pdf bib abs

Grounded Multimodal In-Context Learning for Product Weight Estimation at Scale in E-commerce
Bhavuk Singhal | Arsh Keshari | Ravindra Kumar Yadav

Accurately inferring implicit physical attributes of products, such as weight, is critical for large-scale e-commerce logistics but challenging due to sparse or unreliable textual metadata and high visual variability. We formulate weight estimation as a grounded multimodal reasoning problem and investigate whether large vision-language models (LVLMs) can infer discretized weight buckets through in-context learning (ICL) over product images and descriptions. We introduce a scalable inference framework that conditions predictions on automatically retrieved, category-specific exemplars and propose a distribution-calibrated retrieval strategy that aligns few-shot contexts with the empirical weight distribution of each product sub-category. This calibration substantially improves few-shot multimodal reasoning compared to random or embedding-based retrieval baselines. Across 14 high-variance categories, our approach significantly outperforms strong multimodal KNN baselines in both exact-match accuracy and near-bucket reliability. Deployed in production on a large e-commerce platform, our system processes millions of listings daily and reduces shipping-related revenue leakage by ∼22%, demonstrating that multimodal ICL can serve as a practical and cost-effective alternative to manual or hardware-based verification.

pdf bib abs

Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72%. A two-week online A/B test demonstrates a 28.6% increase in like rate, a 46.2% decrease in dislike rate, and a 92.7% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.

pdf bib abs

Recent industrial credit scoring models remain heavily reliant on manually tuned statistical learning methods. Despite their potential, deep learning architectures have struggled to consistently outperform traditional statistical models in industrial credit scoring, largely due to the complexity of heterogeneous financial data and the challenge of modeling evolving creditworthiness. To bridge this gap, we introduce FinLangNet, a novel framework that reformulates credit scoring as a multi-scale sequential learning problem. FinLangNet processes heterogeneous financial data through a dual-module architecture that combines tabular feature extraction with temporal sequence modeling, generating probability distributions of users’ future financial behaviors across multiple time horizons. A key innovation is our dual-prompt mechanism within the sequential module, which introduces learnable prompts operating at both feature-level granularity for capturing fine-grained temporal patterns and user-level granularity for aggregating holistic risk profiles. Notably, real world deployment yielded a 6.3 pp improvement in KS, along with a 9.9% reduction in bad debt rate.

pdf bib abs

Accurate Point of Interest (POI) attribute acquisition is essential for location-based services, yet traditional modular Interactive Voice Response (IVR) systems suffer from error accumulation and high maintenance overhead. We present DuIVRS-2, a large language model (LLM)-based end-to-end framework designed for large-scale POI attribute acquisition at Baidu Maps. To address the long-tail distribution of real-world interactions, our methodology first employs a finite state machine (FSM)-guided data augmentation strategy to synthesize a balanced and diverse training dataset. We then streamline dialogue management via a selective generation scheme combined with a Chain-of-Thought (CoT) mechanism, which ensures output stability and effectively eliminates hallucinations in industrial settings. To facilitate continuous policy refinement with minimal manual effort, we design a cooperative iterative learning framework that leverages a dual-evaluator voting system. Deployed in production for two months, DuIVRS-2 processed 0.4 million calls daily and achieved a 83.9% Task Success Rate (TSR), outperforming its predecessor by 4 percentage points while maintaining a low reaction time of 130ms. This work provides a production-proven reference for developing robust, cost-effective LLM agents for large-scale industrial dialogue applications.

pdf bib abs

SAJA: A Simple Approach to Judge Alignment for LLM-as-a-Judge
Sneha Kola | Pankaj Kumar Sharma | Soumyadeep Dey | Bamdev Mishra | Mayur Datar

LLM-as-a-Judge systems are increasingly used to evaluate text at scale, yet production deployment demands low latency, minimal cost, and compatibility with closed-source APIs. Current approaches fall short in different ways: some require many LLM calls and per-dataset prompt tuning, others depend on logit access unavailable in commercial APIs, and yet others demand multiple rounds of LLM interaction for iterative feature discovery. We present **SAJA** (**S**imple **A**pproach to **J**udge **A**lignment), built on the principle that task-specific alignment should reside in a lightweight calibration head, not in elaborate prompts or model internals. SAJA makes exactly one LLM call per item using a fixed structured rubric prompt, extracts a multi-dimensional feature vector, and maps it to a human-aligned score via a calibration head trained on a small number of human labels. No iterative prompt search, no logit access, and no multi-round LLM interaction are needed. Yet SAJA matches far more complex systems across four evaluation paradigms: 86% F1 on MT-Bench pairwise preference (vs. 78% uncalibrated), competitive performance on five classification benchmarks with a single call, and +5.71% F1 over prompt-optimized baselines on proprietary data. Ablations confirm that multi-dimensional rubric features outperform one-dimensional calibration (SummEval 𝜌 improves from 0.60 to 0.74) and that coarse rubric outputs recover the same human alignment as full logit distributions (𝜌 = 0.36 vs. 0.37), establishing that logit access is unnecessary for calibrated judge alignment. Moreover, SAJA is model-agnostic: a 9B open-source model with SAJA (𝜌=0.70) surpasses raw GPT-4.1 (𝜌=0.60). Its single-call design yields up to 4.8× cost savings over per-question approaches.

pdf bib abs

With the advancement of vision-language models, web automation has made significant progress. However, deploying autonomous agents in real-world settings remains challenging, primarily due to site heterogeneity, where generalist models lack domain-specific priors for diverse interfaces, and long-horizon instability, characterized by the accumulation of decision drift over extended interactions. To address these challenges, we introduce ColorBrowserAgent (Complex Long-Horizon Browser Agent), a knowledge-evolving agent for robust web automation. Our approach addresses these challenges through two synergistic mechanisms: human-in-the-loop knowledge adaptation that transforms sparse human feedback into reusable domain knowledge, and knowledge-aligned progressive summarization that stabilizes long interactions through memory compression. Extensive experiments on WebArena, WebChoreArena and industrial deployment show that ColorBrowserAgent consistently outperforms strong baselines. It achieves a state-of-the-art success rate of 71.2% on WebArena and maintains 47.4% performance under zero-shot transfer setting on WebChoreArena. In commercial deployment, it improves user satisfaction by 19.3% relatively, verifying its robustness in real-world scenarios.

pdf bib abs

Clinical decisions are often required under incomplete information. Clinical experts must identify whether available information is sufficient for judgment, as both premature conclusions and unnecessary abstention can compromise patient safety. To evaluate this capability of large language models (LLMs), we developed ClinDet-Bench, a benchmark based on clinical scoring systems that decomposes incomplete-information scenarios into determinable and undeterminable conditions. Identifying determinability requires considering all hypotheses about missing information, including unlikely ones, and verifying whether the conclusion holds across them. We find that recent LLMs fail to identify determinability under incomplete information, producing both premature conclusions and excessive abstention, despite correctly explaining the underlying scoring knowledge and performing well under complete information. These findings suggest that existing benchmarks are insufficient to evaluate the safety of LLMs in clinical settings. ClinDet-Bench provides a framework for evaluating determinability recognition, leading to appropriate abstention, with potential applicability to medicine and other high-stakes domains, and is publicly available.

pdf bib abs

Artificial intelligence doctor assistants (AIDAs) help streamline clinical decision-making and reduce physician workload. While existing systems primarily utilize Large Language Models (LLMs) or retrieval-augmented generation (RAG), these methods typically retrieve static facts—whether as text passages or structured graphs—lacking the explicit logical pathways essential for multi-step reasoning. In this paper, we propose the AIDA-SEAT framework to provide reliable clinical decision-making support. First, we design the state-evaluation-action tree (SEAT), which covers diagnosis, treatment, and examination. To develop this tree, we refine and transform SEAT collected from medical documents and doctors. Then, we propose an adaptive method to select optimal trees tailored to the current patients’ state. Finally, we leverage LLMs to perform state assessment, evaluation, and action execution based on the tree, thereby generating reliable responses. To evaluate the effectiveness of our method, we conducted extensive experiments on a self-built dataset. Our method achieves 1.01% higher than current state-of-the-art (SOTA) baselines across five departments, including common RAG-based methods. Furthermore, analysis of 200 consultation records during deployment on an online hospital revealed that system-assisted responses are 24.16 seconds faster on average than manual ones, improving efficiency by 26.85%.

pdf bib abs

IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance
Chathurangi Shyalika | Dhaval C Patel | Amit Sheth

Industrial maintenance environments increasingly rely on AI systems to assist operators in understanding asset behavior, diagnosing failures, and evaluating interventions. Although large language models (LLMs) enable fluent natural-language interaction, deployed maintenance assistants routinely produce generic explanations that are weakly grounded in telemetry, omit verifiable provenance, and offer no testable support for counterfactual or action-oriented reasoning that undermine trust in safety-critical settings. We present IndustryAssetEQA, a neurosymbolic operational intelligence system that combines episode-centric telemetry representations with a Failure Mode and Effects Analysis Knowledge Graph (FMEA-KG) to enable Embodied Question Answering (EQA) over industrial assets. We evaluate on four datasets covering four industrial asset types, including rotating machinery, turbofan engines, hydraulic systems, and cyber–physical production systems. Compared to LLM-only baselines, IndustryAssetEQA improves structural validity by up to +0.51, counterfactual accuracy by up to +0.47, and explanation entailment by +0.64, while reducing severe expert-rated overclaims from 28% to 2% ( 93%). Code, datasets, and the FMEA-KG are available at: https://github.com/IBM/AssetOpsBench/tree/IndustryAssetEQA/IndustryAssetEQA

pdf bib abs

Industrial maintenance platforms contain rich but fragmented evidence, including free-text work orders, heterogeneous operational sensors or indicators, and structured failure knowledge. These sources are often analyzed in isolation, producing alerts or forecasts that do not support conditional decision-making: given this asset history and behavior, what is happening and what action is warranted?We present Condition Insight Agent, a deployed decision-support framework that integrates maintenance language, behavioral abstractions of operational data, and engineering failure semantics to produce evidence-grounded explanations and advisory actions. The system constrains reasoning through deterministic evidence construction and structured failure knowledge, and applies a rule-based verification loop to suppress unsupported conclusions.Case studies from production CMMS deployments show that this verification-first design operates reliably under heterogeneous and incomplete data while preserving human oversight. Our results demonstrate how constrained LLM-based reasoning can function as a governed decision-support layer for industrial maintenance.

Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality.This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.

pdf bib abs

Layer-aligned distillation and convergence-based early exit represent two predominant computational efficiency paradigms for transformer inference; yet we establish that they exhibit fundamental incompatibility under standard deployment conditions for convergence-based early exit. Distillation objectives that align intermediate student layers to teacher representations suppress the representational convergence that early-exit mechanisms exploit, rendering such mechanisms ineffective on distilled models.We introduce LEAP (Layer-wise Exit-Aware Pretraining), an auxiliary training objective that reconciles this incompatibility. LEAP requires no architectural modifications; it augments standard distillation with a single constraint ensuring intermediate layers approximate final-layer representations. LEAP-MiniLM achieves 1.61× measured wall-clock speedup (batch = 1, NVIDIA L4) at 𝜃 = 0.95, with 91.9% of samples exiting by layer 7 and 1.80× theoretical layer reduction, where standard distilled models achieve zero effective speedup. We validate across sentence similarity (STS-B: 0.760 ± 0.006) and retrieval benchmarks (BEIR), providing operational guidance including latency measurements, decision thresholds, and deployment criteria.

pdf bib abs

Teaching language models to reason about code execution is still an open problem. Current synthetic Chain-of-Thought (CoT) training data often consists of plausible-sounding explanations generated by teacher models, not verifiable accounts of actual program behavior. This causes models to learn logically flawed reasoning patterns despite syntactic correctness.We address this by grounding CoT generation directly in program execution traces. Our pipeline instruments code to capture dynamic behavior, narrates execution traces into natural language, and actively verifies each rationale against the trace. We systematically create 54,000 execution-verified, bi-directional rationales that teach models to reason both forward (input→output) and backward (output→input). Models fine-tuned on our verified data achieve substantial improvements, with a performance boost of +24.2 on LiveCodeBench-Exec, +22.3 on CruxEval-Output, and +21.1 on CruxEval-Input, demonstrating that verification quality directly determines both reasoning and code generation capabilities.

pdf bib abs

Generative information retrieval (GenIR) formulates the retrieval process as a text-to-text generation task, leveraging the vast knowledge of large language models. However, existing works primarily optimize for relevance while often overlooking document trustworthiness. This is critical in high-stakes domains like healthcare and finance, where relying solely on semantic relevance risks retrieving unreliable information. To address this, we propose an Authority-aware Generative Retriever (AuthGR), the first framework that incorporates authority into GenIR. AuthGR consists of three key components: (i) Multimodal Authority Scoring, which employs a vision-language model to quantify authority from textual and visual cues; (ii) a Three-stage Training Pipeline to progressively instill authority awareness into the retriever; and (iii) a Hybrid Ensemble Pipeline for robust deployment. Offline evaluations demonstrate that AuthGR successfully enhances both authority and accuracy, with our 3B model matching a 14B baseline. Crucially, large-scale online A/B tests and human evaluations conducted on the commercial web search platform confirm significant improvements in real-world user engagement and reliability.

pdf bib abs

Person-Job Fit (PJF) is a critical component for online recruitment. Existing approaches face several challenges, particularly in handling low-quality job descriptions and similar candidate-job pairs, which impair model performance. To address these challenges, this paper proposes a large language model (LLM) based method with two novel techniques: (1) LLM-based data augmentation, which polishes and rewrites low-quality job descriptions by leveraging chain-of-thought (COT) prompts, and (2) category-aware Mixture of Experts (MoE) that assists in identifying similar candidate-job pairs. This MoE module incorporates category embeddings to dynamically assign weights to the experts and learns more distinguishable patterns for similar candidate-job pairs. We perform offline evaluations and online A/B tests on our recruitment platform. Our method relatively surpasses existing methods by 2.40% in AUC and 7.46% in GAUC, and boosts click-through conversion rate (CTCVR) by 19.4% in online tests, saving millions of CNY in external headhunting expenses.

pdf bib abs

EfficientTool: A Cost-Effective Aligning Framework for Tool-Conditioned Agents in SME Scenarios
Yuanqi Mu | Bingfeng.Pi | Defei Xia | Lei.Zuo | Yongqi Zhang

Large language models (LLMs) are increasingly adopted in downstream industries, yet aligning proprietary agents remains challenging due to limited high-quality data and hardware constraints in small and medium-sized enterprises (SMEs).We propose EfficientTool, a cost-effective, tool-conditioned alignment framework forming a closed loop over data collection, iterative training, and deployment-oriented evaluation.EfficientTool adopts a self-evolving bootstrapping-based Trajectory Collection Pipeline for high-quality trajectory generation, followed by iterative Model Training Pipeline using tool-conditioned parameter-efficient fine-tuning (PEFT).We evaluate the model with Interaction and Evaluation Pipeline in public and private benchmarks, and deploy for an internal enterprise agent.Results show that EfficientTool effectively aligns model in SME scenarios while preserving general tool-calling capability.

pdf bib abs

As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4’s hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.

pdf bib abs

Securing the Tool Layer: A Threat Taxonomy and Runtime Defense Framework for Model Context Protocol Deployments
Saurabh Yergattikar

The Model Context Protocol (MCP) has rapidly emerged as the dominant standard for connecting large language models to external tools, databases, and services. Yet this convenience introduces a fundamentally new attack surface that existing LLM safety measures fail to address: adversaries can now compromise AI agents not through the user prompt, but through the tools the agent trusts. We present ShieldMCP, a runtime security framework grounded in two contributions. First, we introduce a structured threat taxonomy derived from analysis of 80+ attack techniques catalogued under the SAFE-MCP initiative within the Linux Foundation’s OpenSSF, spanning 14 tactical categories adapted from the MITRE ATT CK methodology. Second, we describe an interception architecture that performs real-time validation of MCP tool calls and responses, combining structural analysis with semantic intent verification to detect tool poisoning, indirect prompt injection through tool outputs, and supply chain manipulation. In red-team evaluation across five popular LLM backends, ShieldMCP reduces attack success rates from 74% to under 9% for tool poisoning and from 47% to under 6% for indirect prompt injection via tool responses, while adding fewer than 120ms of median latency per tool call. We discuss deployment considerations, the tension between security and agent utility, and lessons applicable to any organization integrating MCP into production workflows. Our framework is categorized as an Emerging system intended for real-world deployment.

pdf bib abs

Enhancing Job Evaluation with Data Augmentation and Text Classification
Samaneh Jalilian | Niels van Weeren | Mohammad Shokri | Thijmen Bijl | Suzan Verberne

Accurate job grading and evaluation are essential for ensuring fair compensation in Human Resources (HR) planning. In this research, we propose to improve job evaluation by semi-automating a manual, time-consuming, and inconsistent process with text-based classification models. We address three prediction tasks: job title classification, grading, and compensation prediction. For job title classification, we fine-tune a RoBERTa model for classification and use Gemini to generate synthetic job descriptions for rare job titles. For grade and compensation prediction, we compare TF-IDF and transformer-based embeddings (DistilRoBERTa, MPNet, MiniLM) in combination with deep neural networks and tree-based models (Random Forest, XGBoost). We optimize all models using grid search with hyperparameter tuning and cross-validation. The results show that job title classification by RoBERTa with Gemini-generated descriptions works well with an accuracy of about 97%. In our regression experiments, our models get promising results: for grade prediction, a tuned TF-IDF + XGBoost model achieves a mean absolute error (MAE) of 0.185, and for annual salary prediction, MiniLM embeddings with XGBoost get an MAE of €1,587. These findings demonstrate that a semi-automated pipeline can enhance traditional manual processes by boosting consistency, speeding up HR workflows, and reducing biased assessments.

pdf bib abs

Industrial Retrieval-Augmented Generation (RAG) systems depend on optical character recognition (OCR) to transform visual documents into text. Existing OCR benchmarks rely on character-level metrics, which inadequately measure downstream RAG effectiveness under real-world conditions. We introduce an OCR benchmark for industrial RAG systems covering 11 challenging document types, including extreme layouts, high-resolution pages, complex or watermarked backgrounds, historical documents with non-standard reading orders, visually decorated text, and documents containing tables and mathematical formulas. Evaluating recent SOTA OCR models under a controlled OCR-first RAG pipeline shows clear performance degradation on realistic industrial documents despite strong conventional benchmark scores. We find that high OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. Further analysis shows that this mismatch is category-dependent, arises through both retrieval-side and downstream generation-side failures, and remains stable across representative OCR-first pipeline choices. The benchmark is publicly available at https://github.com/Qihoo360/InduOCRBench.

pdf bib abs

Evaluating LLM-generated business ideas is often harder to scale than generating them.Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree.This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually?We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions:specificity, technical validity, innovativeness, competitive advantage, need validity, and market size.Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise.We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator’s scoring history.Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning.These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.

pdf bib abs

Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning
JuneYoung Park | Yuri Hong | Seongwan Kim | Jaeho Lee

On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6–12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA’s low-rank structure. Our key insight is that the intermediate projection h = xA can be recomputed during backward at minimal cost since rank r ≪ d_in, eliminating the need to store it. MeSP achieves 49% average memory reduction compared to MeBP on Qwen2.5 models (0.5B–3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO’s gradient estimates show near-zero correlation with true gradients (cosine similarity ≈0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.

pdf bib abs

Multi-Agent Orchestration for Terminology-Constrained Machine Translation in Industrial Localization
Emanuele Di Rosa

Accurate terminology is a non-negotiable requirement in industrial localization processes: a single mistranslated domain term can violate contractual obligations and erode client trust.We present AIDA_term, a deployed multi-agent LLM pipeline that orchestrates four specialized agents—Analysis, Translation, Post-editing, and Review—for terminology-constrained machine translation.The system introduces terminology-aware pre-analysis, explicit glossary injection at every pipeline stage, and a reasoning-enabled Review agent.We evaluate six configurations on the WMT25 Terminology Translation benchmark (Track 1: en→de/es/ru, IT domain), enabling systematic ablation of each design choice.Our best configuration achieves 99.4% average terminology accuracy while attaining the highest ChrF2++ scores across all three language pairs, outperforming all 20 systems submitted to the shared task.Unlike other multi-agent approaches in WMT25 that rely on generate-and-select strategies, AIDA_term is the first to apply a role-specialized sequential pipeline to terminology-constrained MT, and is deployed with native XLIFF integration for seamless CAT tool interoperability.The system processes thousands of terminology-constrained requests daily at a large localization provider.

pdf bib abs

Anticipating and capturing transient demand spikes is a critical challenge for e-commerce platforms, as reactive discovery mechanisms often fail to surface relevant products during rapid cultural or seasonal shifts. We propose TrendPulse, a three-stage framework that identifies regional search momentum, leverages Large Language Model (LLM) to transform spikes into semantic trends, and employs a cross-attention mechanism to provide personalized catalog recommendations. Our comprehensive ablation experiments and evaluations validate the impact of each architectural component, showing consistent improvements across multiple critical business metrics. TrendPulse’s effectiveness is further validated through online A/B experiments, where it drives measurable gains in both business metrics and overall user experience. Finally, we outlined the deployment strategy in detail, providing a reproducible blueprint that can be readily applied to similar industry-scale applications.

pdf bib abs

Credit risk models suffer from rapid performance decay due to distribution shifts, requiring frequent updates to meet strict operational guardrails. However, manual refreshing takes weeks of trial-and-error across upstream data engineering and downstream training. We present ACRM, a deployed multi-agent framework that automates the end-to-end credit modeling workflow by treating it as a learnable trajectory of agent interactions. Unlike AutoML, which optimizes hyperparameters on fixed datasets, ACRM’s action space extends to upstream data semantics—cohort selection, observation windowing, feature screening—where the majority of performance recovery occurs. A central Orchestrator coordinates specialist agents through a three-stream decision stack: rule-based safety guardrails, retrieval-augmented grounding from historical workflows, and preference alignment via DPO on expert-labeled trajectories. Deployed at a major fintech institution for three months across six business scenarios, ACRM reduced the average model refresh cycle from weeks to 1.1 days and iteration rounds by 65%, while maintaining superior stability metrics.

pdf bib abs

As large language models (LLMs) grow in size, efficient compression techniques like quantization and sparsification are critical. While quantization maintains performance with reduced precision, structured sparsity methods, such as N:M sparsification, often fall short due to limited flexibility and sensitivity to outlier weights. We explore 8:16 semi-structured sparsity, demonstrating its ability to surpass the Performance Threshold—where a compressed model matches the accuracy of its uncompressed or smaller counterpart under equivalent memory constraints. Compared to 2:4 sparsity, 8:16 offers greater flexibility with minimal storage overhead (0.875 vs. 0.75 bits/element). We also apply sparse structured patterns for salient weights, showing that structured sparsity for outliers is competitive with unstructured approaches, leading to equivalent or better results. Finally, we demonstrate that simple techniques such as variance correction and SmoothQuant-like weight equalization improve sparse models performance.

pdf bib abs

The whole-page reranking integrates retrieval results from multiple modalities and is critical for user experience of search engines, yet it requires costly large-scale expert annotations due to the complexity of assessing cross-modal relevances. In this paper, we propose SMAR, a novel whole-page reranking framework that converts single-modal rankers into page-level guidance by constructing budget-aware candidates for cross modal annotations and distilling intra-modality preferences to align relevance scales across modalities. Specifically, we use pre-trained single-modal rankers to construct candidate pages for limited cross-modal annotation at the page level. The whole-page reranker is then trained on these samples, enforcing consistency with single-modal preferences to preserve intra-modal ranking quality. Experiments on the Qilin and CrossRank datasets demonstrate that SMAR reduces annotation costs by 70-90% while outperforming the fully-annotated reranking baselines. Further offline and online A/B tests confirm significant gains in both ranking metrics and user experience, validating the effectiveness and practical value of our approach in real-world search scenarios.

pdf bib abs

Prediction markets provide a unique setting where event-level time series are directly tied to natural-language descriptions, yet discovering robust lead–lag relationships remains challenging due to spurious statistical correlations. We propose a hybrid two-stage causal screener to address this challenge: (i) a statistical stage that uses Granger causality to identify candidate leader–follower pairs from market-implied probability time series, and (ii) an LLM-based semantic stage that re-ranks these candidates by assessing whether the proposed direction admits a plausible economic transmission mechanism based on event descriptions. Because causal ground truth is unobserved, we evaluate the ranked pairs using a fixed, signal-triggered trading protocol that maps relationship quality into realized profit and loss (PnL).On Kalshi Economics markets, our hybrid approach consistently outperforms the statistical baseline. Across rolling evaluations, the win rate increases from 51.4% to 54.5%. Crucially, the average magnitude of losing trades decreases substantially from 649 USD to 347 USD. This reduction is driven by the LLM’s ability to filter out statistically fragile links that are prone to large losses, rather than relying on rare gains. These improvements remain stable across different trading configurations, indicating that the gains are not driven by specific parameter choices. Overall, the results suggest that LLMs function as semantic risk managers on top of statistical discovery, prioritizing lead–lag relationships that generalize under changing market conditions.

pdf bib abs

We introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval.By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling to better preserve global context across long documents.We release pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations.pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark.

pdf bib abs

Actionable Interpretability for Churn Classification: A Text Bottleneck Model Case Study at a Major Telecom Provider
Adrian Sauter | Vera Neplenbroek | Georgios Vlassopoulos | Gianluigi Bardelloni

In subscription-based businesses, understanding why a customer intends to churn is as vital as the classification itself. We present a casestudy at a large European telecommunications provider, where we implement Text Bottleneck Models (TBMs) for post-call churn classifica-tion. The TBM distills dialogues into a sparse set of human-interpretable concepts and provides faithful, snippet-based evidence for everydecision. We show that the TBM performs competitively with black-box baselines and demonstrate potential business impact via automatedcall profiling and an interactive stakeholder dashboard. Our work demonstrates that the perceived trade-off between interpretability andpredictive performance can be bridged, providing the high-accuracy evidence needed for industrial retention strategies.

pdf bib abs

Decoding Text Spans for Efficient and Accurate Named-Entity Recognition
Andrea Maracani | Savas Ozkan | Junyi Zhu | Sinan Mutlu | Mete Ozay

Named Entity Recognition (NER) is a key component in industrial information extraction pipelines, where systems must satisfy strict latency and throughput constraints in addition to strong accuracy. State-of-the-art NER accuracy is often achieved by span-based frameworks, which construct span representations from token encodings and classify candidate spans. However, many span-based methods enumerate large numbers of candidates and process each candidate with marker-augmented inputs, substantially increasing inference cost and limiting scalability in large-scale deployments. In this work, we propose SpanDec, an efficient span-based NER framework that targets this bottleneck. Our main insight is that span representation interactions can be computed effectively at the final transformer stage, avoiding redundant computation in earlier layers via a lightweight decoder dedicated to span representations. We further introduce a span filtering mechanism during enumeration to prune unlikely candidates before expensive processing. Across multiple benchmarks, SpanDec matches competitive span-based baselines while improving throughput and reducing computational cost, yielding a better accuracy–efficiency trade-off suitable for high-volume serving and on-device applications.

pdf bib abs

We introduce Earth Virtual Expert (EVE), the first open-source, end-to-end initiative for developing and deploying domain-specialized LLMs for Earth Intelligence. At its core is EVE-Instruct, a domain-adapted 24B model built on Mistral Small 3.2 and optimized for reasoning and question answering. On newly constructed Earth Observation and Earth Sciences benchmarks, it outperforms comparable models while preserving general capabilities.We release curated training corpora and the first systematic domain-specific evaluation benchmarks, covering MCQA, open-ended QA, and factuality. EVE further integrates RAG and a hallucination-detection pipeline into a production system deployed via API and GUI, supporting 350 pilot users. All models, datasets, and code are publicly available.

pdf bib abs

Hemolix.TabGen: Optimized Table Generation from Documents
Gyanendra Shrestha | Todor Ivanov | Karthik Vemireddy | Anna Pyayt | Michael Gubanov

Modern Data Lakes contain vast and heterogeneous document collections, making table generation from documents a persistent and nontrivial challenge. Traditional approaches are often rigid — i.e. domain-specific, require extensive supervision, or are limited to set of pre-defined schemas; LLM-based approaches are more flexible, but typically suffer from hallucinations, non-determinism, and high computational costs. To overcome these limitations, we introduce Hemolix.TabGen, a novel scalable LLM-based table generation systemthat comprehends documents and generates Bi-dimensional tables based on the entire document content. We evaluated TabGen on 4 publicly available datasets spanning multiple domains and observed an Average Precision delta up to 30% compared to vanilla LLMs

pdf bib abs

Optimizing large language models for industrial sales requires balancing long-term commercial objectives (e.g., conversion rate) with immediate linguistic constraints such as fluency and compliance. Conventional reinforcement learning often merges these heterogeneous goals into a single reward, causing high-magnitude session-level rewards to overwhelm subtler turn-level signals, which leads to unstable training or reward hacking.To address this issue, we propose **Dual-Horizon Credit Assignment (DuCA)**, a framework that disentangles optimization across time scales. Its core, **Horizon-Independent Advantage Normalization (HIAN)**, separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives to the policy update.Extensive experiments with a high-fidelity user simulator show DuCA outperforms the state-of-the-art GRPO baseline, achieving a 6.82% relative improvement in conversion rate, reducing inter-sentence repetition by 82.28%, and lowering identity detection rate by 27.35%, indicating a substantial improvement for an industrial sales scenario that effectively balances the dual demands of strategic performance and naturalistic language generation.

pdf bib abs

Know What You See: Grounded localization of product components
Manan Soni | Abinesh Kanagarajan | Shyam Mohan

Many real-world decisions about products (e.g. how they function, how they should be used) depend on their components rather than the object as a whole. Accurately identifying product component has applications like automated defect detection, visual spare-parts search, and verified assembly. However, existing object detectors treat components as isolated objects, ignoring their inherent structure. We propose Know What You See (KWYS), where we localize components by grounding them using a textual knowledge base (e.g., manuals or web descriptions). KWYS converts raw text into a hierarchical component taxonomy, which then guides an open-vocabulary object detector using a hierarchical verification algorithm. We evaluate on 1,000 product images across 5 diverse categories, improving component localization accuracy by 11% along with reducing component hallucinations by 25%.

pdf bib abs

Knowledge Graph–based Retrieval-Augmented Generation (KG-RAG) enables natural language interaction with structured enterprise knowledge, yet existing agentic approaches that perform well on public benchmarks often fail to generalize to real-world enterprise Knowledge Graphs (KGs), which are dense, schema-driven, and operationally constrained. To address these limitations, we propose SCAIR (Schema-Conditioned Agentic Iterative Reasoning), a training-free framework that integrates structured planning with controlled iterative reasoning by injecting schema-conditioned structural priors and enforcing schema-aware traversal during multi-hop reasoning. Experiments on an enterprise-oriented benchmark constructed from a real-world Configuration Management DataBase (CMDB) demonstrate that SCAIR substantially improves performance over existing KG-RAG methods. Crucially, our study highlights that reliable enterprise graph reasoning cannot rely on generic agentic designs; instead, it must explicitly incorporate the target domain’s structural and operational constraints into the reasoning process. We demonstrate that by aligning agent design with business logic, substantial performance gains can be achieved without the need for costly model retraining.

pdf bib abs

FROST: Factual Reasoning via Optimized Stochastic Trajectories in Large Language Models during Inference
Soumedhik Bharati | Ebad Shabbir | Jiechao Gao

Large language models face a trade-off between factual consistency and reasoningdiversity: deterministic decoding prioritizes reliability but may miss alternativesolution paths, while high-temperature sampling increases exploration at the costof accuracy. We present FROST (Factual Reasoning via Optimized StochasticTrajectories), an inference-time framework that balances exploration andexploitation without additional training or context augmentation. FROST combinesdeterministic inference from a large model with targeted stochastic sampling froma smaller model, selecting outputs via multi-criteria validation over coherence,factual grounding, and semantic novelty. Across HotpotQA, CommonsenseQA, andMMLU, FROST achieves 2–5 percentage point improvements over standard chain-of-thoughtprompting and reduces unsupported outputs by 40% relative to Standard CoT. Comparedto Self-Consistency ensembles, FROST delivers comparable accuracy at 28% lowerinference cost through strategic delegation to smaller models. On an adversarialsubset with unanswerable queries, FROST abstains on 34% of cases versus 8% forstandard chain-of-thought, reducing false positives by 45%. Task-stratifiedevaluation shows that exploration benefits scale with problem ambiguity.Generalization to mathematical reasoning, code generation, and multimodal domainsremains future work.

pdf bib abs

Extracting structured data from unstructured text using large language models (LLMs) becomes challenging when the target schemas are large and complex. In such cases, including the full schema in the prompt increases cost and latency, risks lost-in-the-middle performance degradation, and can exceed context length limits. We propose SchemaRAG, a retrieval-augmented generation (RAG) framework that dynamically prunes the output schema space for schema-conditioned information extraction tasks by leveraging schema metadata and few-shot examples (when available). We evaluate SchemaRAG on real-world healthcare and e-commerce datasets. Our results show that SchemaRAG can achieve up to an 8.8% increase in micro-F1, a 47% reduction in latency, and a 48% reduction in token costs, demonstrating its practicality for large-schema extraction.

pdf bib abs

Enterprise IT support interactions are fundamentally diagnostic: effective resolution requires iterative evidence gathering from ambiguous user reports to identify an underlying root cause. While retrieval-augmented generation (RAG) provides grounding through historical cases, standard multi-turn RAG systems lack explicit diagnostic state and therefore struggle to accumulate evidence and resolve competing hypotheses across turns.We introduce DQA, a diagnostic question-answering framework that maintains persistent diagnostic state and aggregates retrieved cases at the level of root causes rather than individual documents. DQA combines conversational query rewriting, retrieval aggregation, and state-conditioned response generation to support systematic troubleshooting under enterprise latency and context constraints.We evaluate DQA on 150 anonymized enterprise IT support scenarios using a replay-based protocol. Averaged over three independent runs, DQA achieves a 78.7% success rate under a trajectory-level success criterion, compared to 41.3% for a multi-turn RAG baseline, while reducing average turns from 8.4 to 3.9. This improvement reflects the benefit of explicitly representing competing explanations and aggregating evidence across turns in unscripted troubleshooting.

pdf bib abs

We introduce Agentic Economic Modeling (AEM), a framework that aligns synthetic LLM choices with small-sample human evidence for reliable econometric inference. AEM first generates task-conditioned synthetic choices via LLMs, then learns a bias-correction mapping from task features and raw LLM choices to human-aligned choices, upon which standard econometric estimators perform inference to recover demand elasticities and treatment effects. We validate AEM in two experiments. In a large scale conjoint study with millions of observations, using only 10% of the original data to fit the correction model lowers the error of the demand-parameter estimates, while uncorrected LLM choices even increase the errors. In a regional field experiment, a mixture model calibrated on 10% of geographic regions estimates an out-of-domain treatment effect of -65±10 bps, closely matching the full human experiment (-60±8 bps). Under time-wise extrapolation, training with only day-one human data yields -24 bps (95% CI: [-26, -22], p<1e-5), improving over the human-only day-one baseline (-17 bps, 95% CI: [-43, +9], p=0.2049). These results demonstrate AEM’s potential to improve RCT efficiency and establish a foundation method for LLM-based counterfactual generation.

pdf bib abs

Robust Explanations for User Trust in Enterprise NLP Systems
Guilin Zhang | Kai Zhao | Jeffrey Friedman | Xu Chu | Amine Anoun

Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases). We find that decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average), and that stability improves with model scale (44% gain from 7B to 70B). Finally, we relate robustness improvements to inference cost, yielding a practical cost–robustness tradeoff curve that supports model and explanation selection prior to deployment in compliance-sensitive applications.

pdf bib abs

Reinforcement learning (RL) has become a cornerstone of the post-training pipeline for large language models (LLMs), enabling capabilities such as complex reasoning and tool use. However, standard RL approaches face significant challenges due to reward sparsity. Moreover, LLMs typically exhibit mode-seeking behavior, concentrating probability mass on high-likelihood regions. This lack of diversity biases the model toward premature exploitation, hindering the exploration necessary for optimal learning. To address this, we propose VEG (verbal 𝜖-greedy), a novel framework that leverages external feedback as a dynamic control variable to explicitly balance exploration and exploitation within the semantic space. This method not only supplements sparse final rewards with intermediate signals but also enforces sustained exploration throughout the training process. Experiments on Tau Bench and SearchQA demonstrate that our method achieves superior accuracy compared to standard RL baselines. Notably, the trained policy eventually outperforms the external feedback model itself, demonstrating that VEG enables the agent to effectively filter and improve upon the guidance it receives.

pdf bib abs

Production LLMs must balance modeling quality with predictable latency, stable accelerator utilization, and cost-efficient scaling—constraints that remain difficult for existing architectures. Transformers provide strong reasoning but incur quadratic complexity, while state-space models (SSMs) scale efficiently yet lack fine-grained interactions; prior hybrids either introduce sequential bottlenecks or rely on learned routing that complicates deployment. We present FlowHN, a deployment-oriented parallel hybrid architecture that enables deterministic conditional computation via FLOP-aware token circulation across attention and SSM branches. Instead of dynamic expert routing, FlowHN performs hardware-aligned token scheduling that balances workloads, reduces synchronization stalls, and preserves full parameter utilization. Across 135M–1B models, FlowHN achieves up to 4× higher throughput and 15% higher MFU than strong Transformer, SSM, and hybrid baselines while maintaining competitive accuracy on reasoning, coding, and long-context tasks up to 32K tokens. FlowHN is designed to integrate directly into existing Hybrid pipelines without changes to optimizers, training stacks, or inference serving infrastructure, making it practical for real-world deployment.

pdf bib abs

Traditional phishing website detection relies on static heuristics or reference lists, which lag behind rapidly evolving attacks. While recent systems incorporate large language models (LLMs), they are still prompt-based, deterministic pipelines that underutilize reasoning capability.We present MemoPhishAgent (MPA), a memory-augmented multi-modal LLM agent that dynamically orchestrates phishing-specific tools and leverages episodic memories of past reasoning trajectories to guide decisions on recurring and novel threats.On two public datasets, MPA outperforms three state-of-the-art (SOTA) baselines, improving recall by 13.6%.To better reflect realistic, user-facing phishing detection performance, we further evaluate MPA on a benchmark of real-world suspicious URLs actively crawled from five social media platforms, where it improves recall by 20%.Detailed analysis shows episodic memory contributes up to 27% recall gain without introducing additional computational overhead.The ablation study confirms the necessity of the agent-based approach compared to prompt-based baselines and validates the effectiveness of our tool design.Finally, MPA is deployed in production, processing ∼60K targeted high-risk URLs weekly, and achieving 91.44% recall, providing proactive protection for millions of customers.Together, our results show that combining multi-modal reasoning with episodic memory yields robust, adaptable phishing detection in realistic user-exposure settings.Our implementation is available at https://github.com/XuanChen-xc/MemoPhishAgent.git.

pdf bib abs

Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning.
Nafew Azim | Fuad Rahman | Nabeel Mohammed

Adapting Large Vision-Language Models (LVLMs) to specialized domains typically demands resource-intensive fine-tuning or access to proprietary parameters (“white-box” access). While decoding-time strategies like Proxy Tuning offer a parameter-efficient alternative, they rely on rigid, static logit arithmetic that fails to account for instance-specific variations in model certainty and domain shift. In this work, we introduce Adaptive Weighted Proxy Tuning (AWPT), a gray-box steering framework that dynamically modulates the logit contributions of a large base model, a fine-tuned expert, and an untuned anti-expert. Unlike static approaches, AWPT introduces two instance-aware mechanisms: (1) a lightweight ViT-based Weight Predictor that performs amortized inference to estimate optimal mixing coefficients in real-time with negligible added latency (∼0.03s overhead), and (2) a Per-Sample Optimization objective that establishes theoretical performance bounds via gradient-based logit steering. Extensive evaluation across medical (ROCOv2, IU-Xray) and general domains (Flickr30k, MS COCO, TextCaps) demonstrates that AWPT achieves performance parity with fully fine-tuned models while remaining parameter-free regarding the generator. Crucially, our dynamic weighting acts as an effective regularizer, significantly reducing object hallucinations and establishing AWPT as a robust solution for deploying general-purpose LVLMs in safety-critical contexts.

pdf bib abs

NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes.We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata.Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.

pdf bib abs

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau^2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on- and off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates, as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

pdf bib abs

Large language models (LLMs) are increasingly being deployed in cost- and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and introduce Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic sample filtering framework for learning selective and efficient reasoning. For each given context, Ada-RS scores multiple sampled completions with an adaptive length-penalized reward then applies stochastic rejection sampling to retain only high-reward candidates (or preference pairs) for downstream optimization. We demonstrate how Ada-RS plugs into both preference pair (e.g. DPO) or grouped policy optimization strategies (e.g. DAPO). Using Qwen3-8B with LoRA on a synthetic tool call-oriented e-commerce benchmark, Ada-RS improves the accuracy-efficiency frontier over standard algorithms by reducing average output tokens by up to ∼80% and reducing thinking rate by up to ∼95% while maintaining or improving tool call accuracy. We further demonstrate that these gains generalize across model scales (Qwen3-1.7B, 8B, 14B) and domains (τ 2-Bench airline and telecom). These results highlight that training signal selection is a powerful lever for efficient reasoning in latency-sensitive deployments.

pdf bib abs

Multimodal large language models (MLLMs) are effective at capturing the semantics of short video content; however, they often fail to attend to the policy-specific details required for reliable content moderation.To address this limitation, we introduce IPS, a novel framework that integrates In-prompt Process Supervision into MLLMs by introducing sequential reasoning over ancillary questions during fine-tuning. IPS consistently outperforms baseline MLLMs on public and proprietary benchmarks.Moreover, replacing human-annotated ancillary labels with MLLM-generated ones results in only marginal performance degradation, demonstrating robustness to noisy supervision and strong scalability with model-generated annotations.These findings establish IPS as a scalable and effective solution for complex multimodal classification in large-scale industrial settings.

pdf bib abs

As a primary medium for human interaction and information exchange, social networking services (SNS) present distinct challenges for large language models (LLMs): rapidly evolving norms and slang, and culturally diverse content that causes knowledge distribution shift. While supervised fine-tuning (SFT) can improve in-domain performance, it often induces a ”seesaw” trade-off with out-of-domain robustness, especially for smaller models. To address these challenges, we present RedOne 2.0, an SNS-oriented LLM developed with a progressive, RL-prioritized post-training paradigm for fast and stable adaptation. Our pipeline has three stages: (1) Exploratory Learning on curated SNS corpora to establish initial alignment and surface systematic weaknesses; (2) Targeted Fine-Tuning that applies SFT only to diagnosed gaps while mixing a small amount of general data to reduce forgetting; and (3) Refinement Learning that re-applies RL with SNS-centric signals to consolidate gains and balance trade-offs across tasks. Across various tasks in three categories, our 4B model improves by 2.41 on average over the prior 7B RedOne baseline. It also yields an 8.74 average gain over its Qwen3-4B base while using less than half the data required by the SFT-centric method, demonstrating superior data efficiency and stability at compact scales. Overall, RedOne 2.0 provides a competitive, cost-effective baseline for SNS-specific LLMs, improving capability without sacrificing robustness.

pdf bib abs

Traditional industrial agents rely on modular pipelines, including Router, Retriever, Planner, Executor, Responder, Reviewer and so on, which inevitably fracture into a labyrinth of ad-hoc patches, leading to cascading errors and high latency. We propose OneModel, an applicable paradigm shift from external workflows to internalized knowledge representation. Unlike modular systems that slice fluid user intents into static steps, OneModel consolidates complex business logic and SOPs directly into the model’s parameters.Through Continual Pre-training (CPT) and logic-compilation SFT, we transform fragmented business rules into the model’s intuitive reasoning within a unified attention space. Deployed in our global financial service system, OneModel effectively breaks the impossible triangle of latency, accuracy, and complexity. Online A/B testing demonstrates end-to-end latency reduction of more than 50% (18.7s → 8s) while the Intelligent Resolution Rate (IRR) jumps from 64.3% to 83.3%. The results demonstrate our paradigm OneModel effectively replaces brittle engineering logic with internalized cognitive intuition, offering a scalable and future-proof blueprint for transitioning industrial agents from complex, error-prone workflow to unified model architectures.

pdf bib abs

FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking
Denys Katerenchuk | Pablo Duboue | Keelan Evanini

Large language models (LLMs) are rapidly being adopted across various domains. However, their adoption in banking industry faces resistance due to demands for high accuracy, regulatory compliance, and the need for verifiable and grounded responses. We present a unified, data-efficient framework for training grounded domain-specific LLMs that optimizes answer quality, citation grounding, and calibrated refusal under real-world deployment constraints. First, we describe a data generation pipeline that combines LLM-as-a-Judge filtering, citation annotation, and curriculum learning with only 143M tokens. The resulting 12B model achieves high answer quality outperforming GPT-4.1 on citation grounding, with a modest citation tradeoff versus the untuned base. Second, we propose a calibrated refusal mechanism: training on 22% unanswerable examples yield a 12% “I don’t know” rate, substantially improving over the base model’s unsafe 4.3% rate while avoiding GPT-4.1’s over-refusal (20.2%). Third, we present an end-to-end methodology spanning from data curation to quantized serving. The system is deployed at 40+ financial institutions, achieving a 7.1percentage point improvement in query resolution (p < 0.001). Additionally, the model delivers 3–5x faster responses at 20–50x lower cost compared to GPT-4.1.

pdf bib abs

CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades
Raeyoung Chang | Dongwook Kwon | Jisoo Lee | Nikhil Verma

Cascaded LLM systems coordinate models of varying sizes with human experts to balance accuracy, cost, and abstention under uncertainty. However, single-model tiers at each stage falter on ambiguous queries, triggering premature escalations to costlier models or experts due to under-confidence and inefficient compute scaling. **CascadeDebate** addresses this critical gap by inserting multi-agent deliberation directly at each tier’s escalation boundary. Confidence-based routers activate lightweight agent ensembles only for uncertain cases, enabling consensus-driven resolution of ambiguities internally, without invoking higher-cost upgrades. Our unified architecture alternates single-model inference with selective multi-agent deliberation across model scales, culminating in human experts as final fallback. This design scales test-time compute dynamically to query difficulty. Across five benchmarks spanning science, medicine, and general knowledge, CascadeDebate outperforms strong single-model cascades and standalone multi-agent systems by up to 26.75%.An online threshold optimizer proves essential, boosting accuracy 20.98–52.33% relative improvement over fixed policies and enabling elastic adaptation to real-world distributions.

pdf bib abs

Recent advances in LLMs have accelerated both information generation and misinformation, especially in low-resource languages like Vietnamese, motivating robust fact-checking systems. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97% strict accuracy on ISE-DSC01 and 80.82% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7× while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation.

pdf bib abs

Task-oriented proactive dialogue agents play a pivotal role in recruitment, particularly for steering conversations towards specific business outcomes, such as acquiring social-media contacts for private-channel conversion. Although supervised fine-tuning and reinforcement learning have proven effective for training such agents, their performance is heavily constrained by the scarcity of high-quality, goal-oriented domain-specific training data. To address this challenge, we propose SimRPD, a three-stage framework for training recruitment proactive dialogue agents. First, we develop a high-fidelity user simulator to synthesize large-scale conversational data through multi-turn online dialogue. Then we introduce a multi-dimensional evaluation framework based on Chain-of-Intention (CoI) to comprehensively assess the simulator and effectively select high-quality data, incorporating both global-level and instance-level metrics. Finally, we train the recruitment proactive dialogue agent on the selected dataset. Experiments in a real-world recruitment scenario demonstrate that SimRPD outperforms existing simulator-based data selection strategies, highlighting its practical value for industrial deployment and its potential applicability to other business-oriented dialogue scenarios.

pdf bib abs

ProductResearch: Training E-Commerce Deep Research Agents via Multi-Agent Synthetic Trajectory Distillation
Jiangyuan Wang | Kejun Xiao | Huaipeng Zhao | Tao Luo | Xiaoyi Zeng

Large Language Model (LLM)-based agents show promise for e-commerce conversational shopping, yet existing implementations lack the interaction depth and contextual breadth required for complex product research. Meanwhile, the Deep Research paradigm, despite advancing information synthesis in web search, suffers from domain gaps when transferred to e-commerce. We propose ProductResearch, a multi-agent framework that synthesizes high-fidelity, long-horizon tool-use trajectories for training robust e-commerce shopping agents. The framework employs a User Agent to infer nuanced shopping intents from behavioral histories, and a Supervisor Agent that orchestrates iterative collaboration with a Research Agent to generate synthetic trajectories culminating in comprehensive, insightful product research reports. These trajectories are rigorously filtered and distilled through a reflective internalization process that consolidates multi-agent supervisory interactions into coherent single-role training examples, enabling effective fine-tuning of LLM agents for complex shopping inquiries. Extensive experiments show that a compact MoE model fine-tuned on our synthetic data achieves substantial improvements over its base model in response comprehensiveness, research depth, and user-perceived utility, approaching the performance of frontier proprietary deep research systems and establishing multi-agent synthetic trajectory training as an effective and scalable paradigm for enhancing LLM-based shopping assistance.

pdf bib abs

Related search query recommendation is essential for enhancing user engagement and information discovery on digital platforms. While Large Language Models (LLMs) have shifted the field toward generative retrieval, existing methods suffer from two primary limitations: (1) pointwise generation via beam search often leads to semantic redundancy and wasted retrieval quota, and (2) current listwise approaches lack explicit reasoning, relying on superficial click-through rate (CTR) rewards. In this paper, we propose ReList, a novel framework that transforms related search into a reasoning-enhanced listwise generation task. ReList follows a two-stage training paradigm: first, Reasoning Activation constructs a high-quality dataset by back-translating diverse query lists into Chain-of-Thought (CoT) rationales; second, Alternative Training iteratively evolves the model using Reinforcement Learning with a Gated Multi-Objective Reward and a Corrective SFT mechanism to handle hard samples. Experimental results on real-world search benchmarks and online A/B tests demonstrate that ReList significantly outperforms state-of-the-art methods in both query diversity and user engagement, providing more insightful and logically grounded query recommendations.

pdf bib abs

This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio r as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio r\* follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for r\*, enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.

pdf bib abs

A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows
Han Yuxuan | Yuanxing Zhang | Yushuo Wang | Yichao Jin

Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non-machine-readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task-relevant information. Although recent vision–language models (VLMs) achieve strong benchmark performance, directly applying them end-to-end to full financial reports often leads to unreliable extraction under real-world conditions.We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction. The design separates page localization from multimodal reasoning, enabling more accurate extraction from complex multi-page documents.We evaluated the framework on 120 production KYC documents comprising about 3000 multilingual scanned pages. Across multiple OCR–VLM combinations, the proposed pipeline consistently outperforms direct PDF-to-VLM baselines, improving field-level accuracy by up to 31.9 percentage points. The best configuration, PaddleOCR with MiniCPM-o-2.6, achieves 87.27% accuracy. Ablation studies show that page-level retrieval is the dominant factor in performance improvements, particularly for complex financial statements and non-English documents.

pdf bib abs

Earnings calls are a key source of financial information about public companies. However, extracting information from these calls is difficult.Unlike the templatic filings required by the U.S. Securities and Exchange Commission (SEC) to report a company’s financial situation, earnings conference calls have no built-in labels, are unstructured, and feature conversational language.We explore this challenging domain by assessing the information captured by models trained on SEC filings and in-context learning methods. To establish a baseline, we first evaluate the generalization capabilities of SEC-trained models across established SEC datasets.To support our investigation, we introduce three novel benchmarks: (1) SEC Filings Benchmark (SECB), (2) Earnings Calls Benchmark (ECB), and ECB-A, a subset with 5,346 expert annotations to support our qualitative analysis.We find that encoder-based models struggle with the domain shift. Finally, we propose a system utilizing LLMs to perform open-ended extraction from unstructured call transcripts, verified by human evaluation (79.7% precision), providing a baseline for this valuable domain through the consistent tracking of emergent KPIs.

pdf bib abs

Structure-Guided Entity Resolution: Fine-Tuning LLMs for Robust Name Matching in Complex Linguistic Contexts
Shivam Chourasia | Hitesh Kapoor | Nilesh Patil

Matching person names across heterogeneous records is a core challenge in entity resolution, especially within linguistically and culturally complex environments. Variations in naming conventions, inconsistent transliteration across scripts, and frequent data entry errors make it difficult to unify user identities, an essential requirement for Know Your Customer (KYC) compliance. While Large Language Models have shown promise in understanding natural language, they often struggle with the structured ambiguity present in such domain-specific settings. This paper introduces Structure-Guided Entity Resolution (SGER), a novel framework that fine-tunes an LLM through a two-phase curriculum. The model is first trained to parse the grammatical and semantic structure of personal names, then optimized for the downstream task of binary entity matching. We evaluate SGER in the challenging context of Indian identity data, one of the most linguistically diverse and noisy environments globally. SGER achieves 99.02% accuracy and an F1 of 0.994 on a held-out set of 50,000 real-world pairs, outperforming GPT-4o few-shot prompting and single-stage fine-tuning baselines. The system is fully deployed in production at Dream11, the world’s largest fantasy sports platform, serving 250M+ users. Our results demonstrate that curriculum-guided training enables robust, high-precision entity resolution in real-world multilingual systems at scale.

pdf bib abs

Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication
Stanisław Sójka | Witold Kowalczyk

Legal texts often contain computational legal clauses—provisions whose understanding requires complex logic. While frontier Large Reasoning Models (LRMs) can describe such clauses, building production-ready systems is limited by reasoning errors and the high cost of inference. We propose Amortized Intelligence, a neuro-symbolic approach where we use an LLM once to translate a legal text into Deterministic Autonomous Contract Language (DACL): a typed graph intermediate representation. Adjudication then relies on deterministic graph executions with a visually auditable trace. In comparison against runtime LRM baselines (including GPT-5.2 and Gemini 3 Pro), our DACL-based Agent achieves near-perfect consistency and mitigates the "reasoning cliff" observed in probabilistic models. The system reduces compute costs by over 90% in high-volume workflows while satisfying the strict auditability requirements of legal adjudication.

pdf bib abs

Enterprise LLM agents can dramatically improve workplace productivity, but their core capability, retrieving and using internal context to act on a user’s behalf, also creates new risks for sensitive information leakage. We introduce **CI-Work**, a Contextual Integrity (CI)-grounded benchmark that simulates enterprise workflows across five information-flow directions and evaluates whether agents can convey *essential* content while withholding *sensitive* context in dense retrieval settings.Our evaluation of frontier models reveals that privacy failures are prevalent (violation rates range from 15.8%-50.9%, with leakage reaching up to 26.7%) and uncovers a counterintuitive trade-off critical for industrial deployment: higher task utility often correlates with increased privacy violations.Moreover, the massive scale of enterprise data and potential user behavior further amplify this vulnerability. Simply increasing model size or reasoning depth fails to address the problem. We conclude that safeguarding enterprise workflows requires a paradigm shift, moving beyond model-centric scaling toward context-centric architectures.

pdf bib abs

From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li | Stefano Petrangeli | Yu Shen | Xiang Chen

We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design. LaySPA addresses two key challenges: LLMs’ limited spatial reasoning and the lack of transparency in design decision making. Instead of operating at the pixel level, we reformulate layout design as a policy learning problem over a structured textual spatial environment that explicitly encodes canvas geometry, element attributes, and inter-element relationships. LaySPA produces dual-level outputs comprising interpretable reasoning traces and structured layout specifications, enabling transparent and controllable design decision making. Layout design policy is optimized via a multi-objective spatial critique that decomposes layout quality into geometric validity, relational coherence, and aesthetic consistency, and is trained using relative group optimization to stabilize learning in open-ended design spaces. Experiments demonstrate that LaySPA improves structural validity and visual quality, outperforming larger proprietary LLMs and achieving performance comparable to specialized state-of-the-art layout generators while requiring fewer annotated samples.

pdf bib abs

Practical Guidelines for Model Merging in LLMs Pre-Training
Giuseppe Curci | Stefano Simonazzi | Andrea molinari | Andrea Zugarini

Model merging is widely used to combine fine-tuned models trained with different data distributions, tasks, or hyperparameters, yet its role during LLM pre-training remains underexplored. We systematically study checkpoint merging across training phases, focusing on the transition from stable to decaying learning rates. Across multiple scales, we find that simple averaging methods consistently improve performance during stable learning rate regimes, but gains sharply diminish during decay. We link this effect to reduced checkpoint diversity and show that merging effectiveness correlates with parameter-space variation. Strategies such as synthetic variability, task-vector merging, and cross-run merging yield only modest improvements. Our results provide practical insights on when merging is most effective in large-scale pre-training.

pdf bib abs

Generative AI—powered by Large Language Models (LLMs)—is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization—not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment–Evaluation Gap—the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL—a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics—Economic Break-Even (Nbreak), Intelligence-Per-Watt (IP W ), System Density (ρsys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)—capturing profitability, energy efficiency, hardware scaling, serverless feasibility, and compression safety. Our results reveal a clear efficiency frontier—models in the < 2B parameter class dominate larger baselines across economic and ecological dimensions. LLaMA-3.2-1B (INT4) achieves ROI break-even in 14 requests (median), delivers 3× higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens/s/GB under 4-bit quantization. We further uncover an efficiency anomaly—while QLoRA reduces memory footprint, it increases adaptation energy by up to 7× for small models—challenging prevailing assumptions about quantization-aware training in edge deployment.

pdf bib abs

Large-scale introductory CS courses, often enrolling thousands of students, struggle to provide personalized support and encourage active participation. While recent large language models (LLMs) have enabled AI teaching assistants at scale, most existing systems remain reactive, responding only after students explicitly initiate queries. We present SCALA, a student-centered AI learning assistant designed to provide proactive support for students. SCALA introduces predictive query management, a mechanism that generates likely student questions and answers ahead of lectures. Students may choose to view these pre-generated question–answer pairs or engage in interactive conversations with our tutoring model via the same interface. We evaluate SCALA through a semester-long deployment in an undergraduate Python course with over 1,500 students, and find that predictive queries are frequently selected in practice and substantially overlap with real student questions. Based on student feedback, learners preferred SCALA’s responses to their real queries over alternatives such as GPT-4o. These results suggest proactive support as a promising direction for future development of AI-powered teaching assistants. We will release our codebase and interactive demo upon acceptance.

pdf bib abs

Retail banks handle high volumes of customer interactions across different channels that span various topics. Early and accurate detection of the intent of the customer is critical towards streamlining contact-center operations through efficient routing and handling of conversations. Mining of customer interactions leads to identification of friction points in customer journeys and offers valuable insights about customer needs. Existing approaches to define customer intents or contact reasons remain fragmented, manually maintained across organizations and relying on knowledge of specific business processes. We propose a framework that develops a dynamic hierarchical Reason-of-Contact (RoC) taxonomy to cover customer topics across hundreds of business processes. We further demonstrate the implementation of this taxonomy to a robust solution that identifies intents for all customer conversations across different channels. Our deployed system supports real time use with a 150 to 300 ms turnaround per conversation. It achieves up to 10% improvement in F1 score over baseline approaches on a reference dataset. We also detail deployment considerations, including dynamic taxonomy updates, out-of-domain detection, and auditability. Finally, we present ablations and error analyses to characterize effectiveness.

pdf bib abs

Large language models (LLMs) can answer religious knowledge queries fluently, yet they often hallucinate and misattribute sources, which is especially consequential in Islamic settings where users expect grounding in canonical texts (Qur’an and Hadith) and jurisprudential (fiqh) nuance. Retrieval-augmented generation (RAG) improves grounding, however, a single retrieve-then-generate pipeline is insufficient for diverse Islamic queries, including verbatim scripture, citation-grounded guidance, and rule-constrained computations such as zakat and inheritance. To address these challenges, we present Fanar-Sadiq, a bilingual Arabic-English Islamic QA system built on a multi-agent, tool-augmented architecture. It is a core component of the Fanar AI platform. Fanar-Sadiq routes Islamic queries to specialized modules within an agentic tool architecture. It supports intent-aware routing, retrieval-grounded fiqh answers with normalized citations and verification traces, exact verse lookup with quotation validation, and deterministic Sunni zakat and inheritance calculators with madhhab-sensitive branching. We evaluate the end-to-end system on public Islamic QA benchmarks and show strong effectiveness and efficiency. It is publicly accessible through an API and Web application and has received over 1.9M accesses in less than a year (https://api.fanar.qa/docs).

pdf bib abs

Reviewing medical records for clinical and insurance decisions must handle long, heterogeneous documents while producing consistent, traceable, guideline-compliant outcomes under strict latency and cost constraints. We propose GuideTree, which compiles textual guidelines into a fixed review tree of evidence-grounded verification primitives. GuideTree uses short per-document summaries only for routing each check to a minimal set of document types and candidates; final verification always reads full document text and returns structured evidence. The tree is induced offline via a cost-aware split-and-prune search and updated safely through regression-tested, versioned patches. Across 1,000 cases from four industrial review scenarios and four LLM backbones, GuideTree achieves 84.5–92.8 Macro-F1, outperforming the strongest non-expert baselines by 3.3–7.6 points and matching ExpertTree within 0.2–0.6 points (avg. 0.38). On chronic disease with Qwen3-235B-A22B-Instruct, GuideTree reduces average I/O volume to 74K input+output characters (-82% vs. long-context prompting) and average latency to 22s (-83% vs. long-context prompting), while reaching 99% decision consistency over K=5 reruns.

pdf bib abs

While prompt engineering offers effective control over Text-to-Image (T2I) generation, it remains labor-intensive for large-scale production. We present PRISM-DUEL, a black-box framework that formalizes prompt optimization as Automatic Prompt Engineering (APE), motivated by advertising workflows requiring low-latency, diverse variants faithful to a human-designed ads. Since zero-shot LLMs are unreliable judges of image quality, PRISM-DUEL obtains label-free pairwise preferences and rationales from an LLM judge over pairs of generated images, then uses a dueling-bandit optimizer to optimize a prompt for generating controlled variations while matching the reference ad’s visual content. By iteratively steering the prompt distribution towards higher-quality generations and improving posterior calibration, PRISM-DUEL preserves visual similarity and semantic faithfulness while increasing diversity. Experiments on PartiPrompts and DreamBooth across Gemini 2.5 Flash Image, FLUX.1, and Qwen-Image show consistent gains over strong baselines in visual faithfulness and prompt interpretability.

pdf bib abs

Efficient Agent Evaluation via Diversity-Guided User Simulation
Itay Nakash | George Kour | Ateret Anaby Tavor

Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of full agent-user conversations to estimate success. This approach is computationally inefficient - reprocessing identical conversation prefixes across runs, and often fails to uncover deep failure modes triggered by rare user behaviors.We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), a snapshot-based, coverage-guided user simulation framework for efficient and systematic exploration of multi-turn agent behavior. DIVERT captures the full agent–environment state at critical junctions and resumes execution from these points, reusing shared prefixes to avoid redundant regeneration and reduce token cost. From each junction, it branches with targeted, diverse user responses, enabling directed exploration of alternative interaction paths while preserving task intent.By reallocating computation from redundant restarts to behaviorally salient mid-trajectory states, DIVERT steers evaluation toward under-explored semantic regions and rare interaction failures. Experiments on realistic multi-domain benchmarks show that our method consistently improves failure discovery efficiency and task-level coverage compared to standard linear rollout evaluation, without increasing overall cost.

pdf bib abs

What Question Did You Answer? Refining Contact Center Evaluation Plans via Backward Questions
Prajwal Sood | Rushikesh Pawar | Digvijay Anil Ingle | Anup Pattnaik

Capturing organization-specific domain knowledge remains a critical challenge for deploying cost-efficient language models in specialized tasks like contact center Quality Assurance (QA). While large LMs implicitly capture expert judgment, smaller LMs require explicit evaluation criteria that domain experts struggle to articulate. We introduce Backward Question-based Refinement (BQR), a diagnostic framework that generates backward questions, revealing what a model understood rather than what was asked, to systematically distill implicit reasoning from large LMs into explicit evaluation plans. Through experiments on 12 QA questions, BQR achieves performance improvements on 8 questions with absolute gains of up to 27.8% in Macro F1. Our analysis establishes empirical parallels to gradient-descent optimization and reveals a cross-family advantage where small LMs benefit more from large LMs of different families. These findings confirm BQR as an effective approach for bridging the gap between implicit expert knowledge and explicit evaluation criteria.

pdf bib abs

Pricing automation in large-scale tourism is challenging because travel orders are highly unstructured, while pricing policies are complex, rapidly evolving, and inherently open-ended. Traditional rule engines are brittle and costly to maintain, whereas unconstrained LLM agents lack the reliability and auditability required for financial decisions. We present a production-grade LLM-powered pricing system with a strict decision boundary: LLMs perform structured extraction and bounded policy/path selection, while all numeric pricing, including total-price computation, is executed deterministically. Policies are compiled into interpretable condition trees, enabling open-ended support for new clauses and evolving rules without code changes, while exposing auditable artifacts for human-in-the-loop control. Periodic fine-tuning on logged traces further improves tree induction and path matching. Deployed at a municipal state-owned tourism enterprise across 7 scenic sites and 12 business categories with 1,500+ operators and 1,000+ active policies, the system processed 3,960 orders in six months, reduced the order management team from 15-20 to 3, and cut per-order handling time from 10 minutes to <2 minutes.

pdf bib abs

Diagnose, Then Repair: A Two-Stage MQM-Guided Post-Editing Framework for Domain-Specific Machine Translation
Ji Hun Wang | Siyu Wu

LLM-based machine translation evaluation can closely match human judgments, but in practice it remains largely diagnostic, with the signals rarely translating into direct quality improvements under real production constraints. We propose a two-stage, evaluator-guided automatic post-editing framework that turns MQM-style evaluation into targeted repairs: a retrieval-augmented LLM evaluator outputs structured, span-level MQM diagnoses under an explicit edit contract, and a separate LLM post-editor applies minimal edits restricted to those diagnoses. This separation improves controllability and reduces paraphrastic drift compared to one-stage "judge-and-refine” baselines. In a systematic study involving seven LLMs spanning three model providers and seven languages, our best configuration consistently improves both reference-based COMET and CometKiwi scores over one-stage post-edit methods, while the evaluator’s error spans and severities show strong agreement with human MQM annotations and human editor preferences.

pdf bib abs

Enterprise deep research often fails to produce decision-ready reports due to uneven information coverage, context explosion, and premature stopping. We propose a scalable Enterprise Deep Research (EDR) architecture to address these failures. Our system (i) decomposes requests into coverage-driven objectives via outline generation with reflection, (ii) localizes context with dependency-guided execution and explicit information sharing, and (iii) enforces evidence-based completion criteria so agents iteratively collect information until sufficiency conditions are met. We evaluate on an internal sales enablement task and the public DeepResearch Bench benchmark, where our proposed system design achieves the strongest overall performance compared with competitive deep-research baselines. The results show that dependency-controlled context and explicit evidence sufficiency criteria reduce premature stopping and improve the consistency and depth of enterprise research outputs.

pdf bib abs

Financial Large Language Models (LLMs) exhibit strong domain expertise but remain vulnerable to financially harmful prompts. To systematically assess this vulnerability, we introduce FinHarmBench, a benchmark designed to evaluate financially harmful and confusable benign prompts. Our analysis reveals a concerning result that financial LLMs can be less robust than general-purpose models, suggesting that domain adaptation alone does not guarantee financial safety alignment. To address this issue, we propose Financial Refusal Steering Distillation (FiRSD), an unsupervised training framework that strengthens financial-domain safety by learning and distilling a financial refusal direction at the representation level. FiRSD enhances refusal behavior without requiring annotated refusal responses. Experiments show that FiRSD substantially improves safety while largely preserving task capability. These results highlight the importance of domain-aware safety alignment for high-stakes financial applications.

pdf bib abs

Multi-hop question answering (MHQA) is a practical bottleneck in industry applications such as enterprise assistants, customer-support copilots, and compliance analysis, where systems must combine evidence across multiple documents before answering. Large language models (LLMs) remain brittle in this setting: iterative retrieval can commit too early to low-recall trajectories, while planning-only approaches can produce static query sets that fail to adapt when intermediate evidence changes. We propose Planned Active Retrieval and Reasoning RAG (PAR²-RAG), a training-free two-stage framework that separates coverage from commitment. PAR²-RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. This design targets deployment constraints by avoiding retraining cycles, reducing maintenance overhead under changing corpora, and improving scalability across domains. Across four MHQA benchmarks, PAR²-RAG consistently outperforms strong baselines: compared with IRCoT, it achieves up to 23.5% higher answer accuracy and up to 10.5% NDCG gains in retrieval quality.

pdf bib abs

OmniOData: Unleashing Small Language Models for OData Query Generation with Synthetic Data and Reinforcement Learning
Tao Bai | Zhaochen Li | Hongxin Shao | Daniel Dahlmeier

Despite the success of Large Language Models (LLMs) in structured query generation, OData—a critical RESTful protocol for enterprise APIs—remains under-researched due to a lack of high-fidelity, execution-validated datasets. To bridge this gap, we introduce OmniOData, a framework that generates SynOData, the first large-scale OData corpus featuring execution-grounded queries and reasoning traces. Using this corpus, we develop OmniOData-R1 (1.5B–3B parameters), a family of models that match or surpass frontier proprietary systems, such as GPT-4o and Gemini 3, on realistic industrial benchmarks. Our results demonstrate that the synergy of execution-verified synthetic data and Reinforcement Learning (RL) effectively unlocks the latent reasoning of Small Language Models (SLMs), providing a high-performance, low-latency solution for specialized enterprise query generation.The code and data will be released under an open-source license.

pdf bib abs

Small language models (SLMs) are promising for real-world deployment due to their efficiency and low operational cost. However, their limited capacity struggles with high-stakes legal reasoning tasks that require coherent statute interpretation and logically consistent deduction. Furthermore, training SLMs for such tasks demands high-quality, concise reasoning trajectories, which are prohibitively expensive to manually collect and difficult to curate via standard rejection sampling, which lacks granularity beyond final verdicts. To address these challenges, we propose LegalDrill, a diagnosis-driven synthesis framework that extracts and iteratively refines reasoning trajectories from a capable teacher via fine-grained prompting, then a self-reflective verification is employed to adaptively select the most effective data for the SLM student. The resulting data empower SLM training through supervised fine-tuning and direct preference optimization. Extensive experiments on several legal benchmarks demonstrate that LegalDrill significantly bolsters the legal reasoning capabilities of representative SLMs while bypassing the need for scarce expert annotations, paving a scalable path toward practical legal reasoning systems.

pdf bib abs

From TextBlob to LLM Agents: Sentiment Model Selection for B2B Technical Support with CSAT Ground Truth
Pedro Vidigal

We present a five-year case study of sentiment model selection for customer satisfaction (CSAT) prediction in B2B technical support. Our evaluation uses the complete population of CSAT-rated tickets from an enterprise software company: over 500 tickets comprising ∼2,500 customer comments from 100+ organizations over five years. We evaluate 17 approaches across 5 paradigms (lexicon, off-the-shelf transformers, NLI zero-shot, multi-task LLM agent, and 12 dedicated LLM agents from 6 vendor families), plus 11 fine-tuning experiments (all achieving MCC≤0). Key findings: (1) a dedicated single-task LLM agent reduces neutral bias from 69% to 22%, improving MCC from -0.018 to 0.347 (p<0.001); (2) our results are consistent with the "Alignment Tax" (Lin et al., 2024; Wu et al., 2025) in sentiment classification: Claude Opus 4.6 exhibits 41% neutral predictions and lower recall than its budget model Haiku 4.5 (p=0.003); (3) ∼38% of dissatisfied customers are undetectable by all 12 LLMs due to administrative requests lacking emotional language; (4) Gemini 3 Flash achieves the best MCC (0.347) at 0.60/1K, over 100× cheaper than Claude Opus. We describe the three-phase production deployment and provide practitioner recommendations.

pdf bib abs

The matching paradigm is fundamental to large-scale information retrieval and is widely used in industrial search and advertising systems. Existing approaches employ Large Language Models (LLMs) primarily as feature extractors, underutilizing their full modeling capabilities. To address this limitation, we propose a novel matching paradigm, termed the Unified Generative and Discriminative LLM (UGD). It integrates two-tower, single-tower, and generative tasks within a unified LLM framework via attention-mask partitioning, enabling generative tasks to serve as auxiliary supervision for discriminative learning and facilitating distillation from single-tower to two-tower architectures through a multi-task fine-tuning mechanism. To satisfy online latency constraints, we further introduce a self-distillation variant of UGD with a KMeans-enhanced linearized RQVAE for prompt compression and quantization. This design compresses and quantizes landing-page documents during inference, improving serving efficiency and reducing storage overhead. Extensive experiments show that UGD achieves superior performance and strong practical value. The framework has been deployed in an industrial search engine serving hundreds of millions of users and hundreds of thousands of advertisers, significantly enhancing search experience. Open access upon publication.

pdf bib abs

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Xinlin Wang

Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of <10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a routing-based multi-agent system with collaborative capabilities.Our results show that structured agent frameworks (combining step-by-step reasoning and tool use) substantially improve effectiveness over direct prompting, with single-agent systems achieving the best balance between performance and cost. In contrast, routing-based multi-agent setups introduce additional coordination overhead with limited gains under small-model constraints.Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.

pdf bib abs

FourCorners: A Production Knowledge Graph Unifying Thailand’s Legal System
Pawitsapak Akarajaradwong | Sarana Nutanong | Chompakorn Chaksangchaichot

Jurisdictionally bound domains, such as law, often lack standardized, machine-readable data formats, requiring foundational infrastructure before downstream applications can succeed. We present ThLexGraph, the first unified temporal knowledge graph for Thai legal data, integrating 3,840 laws (6,273 versions) with 87,394 Supreme Court decisions, updated daily. The graph encodes hierarchy, temporal versioning, cross-references, and sequential order, all extracted from unstructured official sources where no structured representation previously existed. A five-setting comparison on NitiBench-Tax isolates data infrastructure as the sole variable: graph-structured retrieval achieves Citation F1 of 0.812 versus 0.666 for practitioner-standard web search and 0.685 for flat vector retrieval, while searching a corpus 53x larger. Trace analysis of 820 agent-issued queries reveals that hierarchy traversal and cross-reference following, capabilities absent from generic retrieval, are exercised in 50% and 16% of questions, respectively. Our system demonstrates that structured modeling of hierarchy, temporal versioning, cross-references, and sequential order can overcome structural limitations of legal data published without standardized formats.

pdf bib abs

Modular Monolingual Adaptation using Pretrained Language Models
Nalin Kumar | Ondrej Dusek

Building monolingual language models (LMs) for low-resource languages typically relies on adapting pretrained language models (PLMs) by finetuning the whole model on the target language. This approach is widely favored over training from scratch, as it enables effective knowledge transfer. Additionally, prior work has shown that using a language-specific tokenizer can enhance the adaptability. In this work, we hypothesize that full model tuning is often unnecessary and propose a more modular approach. Specifically, we replace the tokens, freeze the corresponding embeddings, and tune the rest of the model. We use Scottish Gaelic, Irish, and Quechua for our experiments, with Quechua being a very low-resource language (8.5k training instances). Evaluation on natural language understanding (NLU) tasks – mask-filling, NER, and POS – shows that our proposed approach improves performance when adapting the models to low-resource languages. Additionally, we provide a comprehensive analysis of the effectiveness of training strategies, the choice of pretrained embeddings, and models.

pdf bib abs

Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-aJudges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.

pdf bib abs

RAPIDS: Resume Attack Prompt Injection Detection at Scale
Yohann Augey | Joshua H. Levy | Arda Akdemir

The integration of Large Language Models (LLMs) into recruitment workflows has introduced a critical security vulnerability: indirect prompt injection attacks embedded within resumes can manipulate screening tools to override instructions, effectively jailbreaking the hiring process. Frontier LLMs can detect such anomalies, but deploying them at the scale required for high-volume recruitment is prohibitively slow and costly. At the same time, existing generic prompt injection detectors lack the domain specificity needed for nuanced resume attacks. To address this gap, we introduce RAPIDS, a scalable detection framework with three contributions. First, we release a synthetically generated dataset of injection snippets derived from curated attack seeds spanning multiple adversarial strategies to address data scarcity in this domain. Second, we fine-tune a lightweight Small Language Model (SLM) on this data that outperforms the best off-the-shelf detector by over 50% in relative F1 and approaches frontier LLM accuracy. Third, we propose a cascade architecture in which the fine-tuned SLM serves as a high-recall first stage followed by an LLM verifier. This design achieves ≥ 98% end-to-end recall on both evaluated datasets while delivering a 21-24× latency reduction over standalone frontier LLMs (GPT-5-mini), bringing expected per-request latency to 115-171 ms at roughly 3.5% of the API cost.

pdf bib abs

No Innocence in Styling: Discovery of Privacy Protection Capabilities and Security Risks in Consumer Generative AI Writing Assistants
Mohd. Farhan Israk Soumik | Syed Mhamudul Hasan | Wanniarachchi Kankanamge Malithi Mithsara | Ahmed Imteaj | Abdur R. Shahid

Generative AI writing assistants are now integrated into consumer platforms such as Apple Intelligence and Microsoft Copilot, enabling millions of users to automatically rewrite and stylize their text. While positioned as productivity tools, their deployment at scale introduces important and underexplored implications for privacy and platform safety. This paper examines the dual-use nature of platform-level text stylization. Stylization can enhance privacy by suppressing stylistic signals used for profiling and personal data inference. However, the same transformations can be leveraged to evade automated safeguards, including misinformation detection systems. We conduct empirical case studies on emotion inference and misinformation detection across benchmark datasets using deployed stylization modes. We evaluate downstream impact with fine-tuned open-source models and GPT-4o in a zero-shot setting. Our results show that stylization reduces emotion inference accuracy, lowering profiling risk, while increasing error rates in misinformation detection. This discovery reveal a measurable trade-off among privacy protection, moderation robustness, and stylization, highlighting new design and governance challenges for industry deployment.

pdf bib abs

Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and quality by serving most requests with a small model and selectively routing a fraction to a large model. However, existing routing strategies often rely on heuristics, external predictors, or absolute quality estimation, which fail to capture whether the large model actually provides a worthwhile improvement over the small one. In this paper, we formulate routing as a budget allocation problem and identify marginal gain, i.e., the large model’s improvement over the small model, as the optimal signal for budgeted decisions. Building on this, we propose RouteLMT (routing for LLM-based MT), an efficient in-model router that predicts this expected gain by probing the small translator’s prompt-token representation, without requiring external models or hypothesis decoding. Extensive experiments demonstrate that our RouteLMT outperforms heuristics, quality/difficulty estimation baselines, achieving a superior quality–budget Pareto frontier. Furthermore, we analyze regression risks and show that a simple guarded variant can mitigate severe quality losses.

pdf bib abs

Conversational news recommendation requires grounding each suggestion in a rapidly evolving article corpus while addressing implicit user intents that lack explicit retrievable keywords. To characterize this scenario, we identify 6 intent types from production dialogues: five are implicit and pose fundamental challenges to standard RAG pipelines, forming a critical retrieve-first bottleneck. To address these issues, we introduce intent-driven Semantic ID (SID) generation under a Generate-then-Match paradigm. With two-stage training that consists of multi-task SID alignment and GPT-4 Chain-of-Thought distillation, an LLM maps diverse intents to hierarchical SID prefixes, which are then fuzzy-matched to the current news pool to guarantee fully grounded recommendations. Profile-Aware Dual-Signal Reasoning (PADR) further enables cold-start users to obtain valid recommendations using only profiles. On a mainstream Chinese news platform, our 7B model achieves 0% hallucination and 12.4% L1 match in the 152K open-generation SID space (4 × random baseline). It matches GPT-4+Hybrid RAG on L1 while surpassing it on finer-grained metrics (L2 2 ×, Category +1.2pp) at ∼ 100 × lower cost. Cold-start users, where existing baselines score 0%, achieve 18.0% L1 (6 × random), the highest among all user groups.

pdf bib abs

Qualitative research emphasizes constructing meaning through iterative engagement with textual data. Traditionally, this human-driven process requires navigating coder fatigue and interpretive drift, thus posing challenges when scaling analysis to larger, more complex datasets. Computational approaches to augment qualitative research have been met with skepticism, partly due to their inability to replicate the nuance, context-awareness, and sophistication of human analysis. LLMs, however, present new opportunities to automate aspects of qualitative analysis while upholding rigor and research quality. In this work, we present and benchmark Muse, an interactive qualitative research system that allows researchers to identify themes and annotate datasets, achieving an inter-rater reliability between Muse and humans of Cohen’s 𝜅 = 0.7 for well-specified codes.

pdf bib abs

Entity Exchange in the Wild: A Diagnostic Study of LLM Based Real-World Conversational Entity Extraction
Soumya Jain | Ayush Kumar

Entity extraction from spoken customer–agent conversations is increasingly driving automation in contact centers. In these settings, extraction errors can trigger incorrect system actions, including database updates, verification failures, and unintended workflow execution. While prior work has examined the impact of transcription noise and cross-turn reasoning, it has not systematically analyzed how entity-exchange phenomena themselves shape extraction performance.We model conversational entity exchange along three orthogonal axes: Initiation (how an entity becomes relevant in the dialogue), Evolution (how commitment to an entity’s value develops or changes across turns), and Articulation (how the final committed value is expressed in surface form).We evaluate 16 large language models on 6,387 real-world customer–agent conversations spanning 12 entity types across numeric, alphanumeric, temporal, and free-text categories. Performance varies by as much as 50–60% within the same model depending solely on the underlying entity-exchange phenomena. The most severe failures occur when entity values are revised during the interaction and the model must distinguish intermediate mentions from the final committed value. Even in the absence of revision, digit-by-digit and encoded expressions remain persistent sources of error.Error-Aware prompting improves extraction across all three axes, yielding average gains of up to 6.4% across models. Together, this work provides a structured framework for benchmarking entity extraction in real-world deployments and isolating systematic failure modes grounded in conversational structure.

pdf bib abs

Agentic Context Strategies for Multi-Format Document Understanding: When Should Language Models Use Tools?
Mansi Uniyal | Mukul Singh | Ryan Nadel

Large language models face fundamental trade-offs when processing long documents: full context is expensive and may exceed limits, while RAG risks missing relevant information. We evaluate four context strategies across six frontier models on three document formats (Word, Excel, and PowerPoint). Our key finding: agentic tool-augmented approaches dramatically outperform passive strategies, with RAG+Tools achieving 46% accuracy vs 6% for RAG-only. Tool benefits are consistent across formats (+28-40 points) and models. We further show that (1) intelligent routing matters more than iteration count, (2) tools provide unique capability beyond reasoning loops, and (3) forcing active exploration matches providing context proactively. These results suggest tool augmentation is crucial for complex document QA.

pdf bib abs

Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure via a document structure-aware split that dynamically routes documents through orientation-specific ingestion pipelines, applying explicit layout-aware parsing for vertically structured documents (e.g., reports) and holistic page-level representations for horizontally structured documents (e.g., slide decks). A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context, enabling richer, more grounded answers without any finetuning requirement. Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker’s cost while achieving stronger human alignment.

pdf bib abs

In the fields of advertising design, artistic creation, and cultural dissemination, there is an increasingly urgent demand for high-quality images that cater to fine-grained aesthetic preferences. Although existing large-scale models can generally meet basic requirements for clarity and alignment with textual elements, they still face significant bottlenecks in achieving precise control and aesthetic optimization. To address this limitation, we propose a set of comprehensive preference indicators across two major dimensions, text-image consistency and aesthetic quality, encompassing multiple criteria ranging from exposure and clarity to visual guidance and innovativeness. Building on these indicators, we have developed a generative framework named AesX to steer the model consistently toward a generation path that more closely aligns with human aesthetic sensibilities. Our experimental findings demonstrate that this approach yields significant improvements in both target recognition accuracy and overall visual aesthetic presentation.

pdf bib abs

Transfer Learning for Generalizable Automated LLM Improvement Pipeline for IVR Navigation
Vishal Sankar Ram | Jason Kushner | Manas Paldhe | Youngseo Son

Administrative tasks in the healthcare domain share linguistic commonalities, but it can be time-consuming to manually design LLM prompts for each use case. When calling health insurers, interactive voice response (IVR) systems cause delays in patient care and increase provider burnout due to complex routing and long hold times. Thus, IVR navigation models can offer significant time savings and reduce barriers to care. We propose a production-quality automated LLM pipeline which leverages a small number of human-labeled ground truth datasets to transfer specialized prompts from one task to another; specifically, we perform a cross-task transfer of our IVR navigation logic, adapting the prompt from reaching the claims department to reaching the patient benefit department. Our approach reduces prompt complexity by up to 80% and obtains 82% turn-level accuracy in real-world industrial healthcare settings, surpassing a human-designed prompt at 79%.

pdf bib abs

Beyond Instruction Optimization: Multi-Agent Error-Driven Class Description Refinement for LLM-Based Classification
Hamvir Dev | Shivam Ratnakant Mhaskar | Sasanka Vutla | Anup Pattnaik

Large Language Models (LLMs) have demonstrated considerable efficacy in classification tasks, yet their performance depends on two critical prompt components: Task Instructions (HOW to classify) and Class Descriptions (WHAT defines each class). While prompt engineering research has extensively explored instruction optimization, class descriptions have received comparatively less attention, often being treated as fixed inputs or simple label names. This represents a critical gap for real-world classification tasks, particularly in contact center domains, where labels often suffer from ambiguous boundaries, overlapping definitions, and incomplete coverage of possible cases—substantially limiting accuracy regardless of instruction quality.We propose a multi-agent framework for iteratively refining class descriptions based on classification errors. By analyzing misclassified instances, language agents automatically generate improved descriptions that better capture class distinctions and resolve ambiguities. Empirical evaluation across contact center and public benchmark datasets demonstrates upto 20.71% accuracy improvements over static class descriptions, addressing an orthogonal dimension to existing instruction optimization techniques.

pdf bib abs

TEN: Table Explicitization, Neurosymbolically
Nikita Mehrotra | Aayush Kumar | Sumit Gulwani | Arjun Radhakrishna | Ashish Tiwari

We present TEN, a neurosymbolic approach for extracting tabular data from semistructured text such as copy-pasted content from PDFs, emails, or OCR-flattened outputs. This task poses real-world challenges in domains like finance and healthcare, where manual copy-paste into spreadsheets introduces errors and OCR distortions compromise data integrity, leading to financial losses and flawed decisions.Purely neural methods suffer from hallucinations and structural inconsistencies, hindering deployment robustness. TEN addresses this via a novel triadic feedback loop that iteratively refines table hypotheses to enforce constraints and achieve verifiable convergence.Experiments show TEN outperforms neural baselines in exact match accuracy and lower hallucination rates. A 21-participant user study rates TEN tables more accurate and preferred in over 60% of pairwise comparisons, though verification and correction effort did not differ significantly between conditions.

pdf bib abs

RADAR: Risk-Aware Distilled Adaptive Routing for Efficient Short-Form Video Platform Ecosystem Governance
Baoyu Jing | Zixuan Wang | Junwen Chen | Xin Dong | Bingfeng Deng

Large-scale integrity enforcement on short-form video platforms typically relies on multiple specialized vertical modules, each dedicated to a specific risk category. However, exhaustively executing these computationally intensive modules over massive content streams leads to substantial inference overhead, despite the fact that most content is benign and violations are usually confined to limited policy domains. To address this inefficiency, we propose RADAR, a lightweight risk-aware routing framework that selectively releases low-risk content while dispatching high-risk instances to appropriate vertical modules. Industrial deployment of such routing systems presents two major challenges: (1) systematic label sparsity caused by disjoint annotation pipelines across risk categories, and (2) the capacity-efficiency tradeoff inherent to compact routing architectures. To overcome these challenges, RADAR incorporates Validity-Aware Masking to handle fragmented supervision and Expert-Guided Knowledge Distillation to transfer knowledge from heavyweight expert models into the lightweight router. Experiments on large-scale real-world datasets demonstrate that the proposed masking strategy effectively mitigates disjoint annotation issues, while distillation substantially enhances routing accuracy, enabling the lightweight router to achieve competitive or superior performance compared to specialized expert models.

pdf bib abs

Knowledge Graphs (KGs) are the backbone of reliable industrial data strategies, yet verbalizing them with Large Language Models (LLMs) often leads to unacceptable risks for high-stakes applications, such as hallucinations or omitted relations. To enforce strict semantic fidelity in KG-to-text generation, we introduce a self-supervised round-trip pipeline. The system verbalizes KG triples into text and immediately attempts to reconstruct the original graph from that text; only verbalizations that enable perfect graph recovery are retained. This creates a closed feedback loop that guarantees the generated text is semantically equivalent to the source data. Experiments confirm that our automated round-trip consistency score correlates strongly with expert judgment, effectively acting as a scalable proxy for human review. Furthermore, we show that standard LLMs can bootstrap their own KG-extraction and generation capabilities by fine-tuning on this trusted synthetic data. Our approach yields significant improvements in triple-extraction accuracy and verbalization faithfulness without relying on costly manual annotation or massive teacher models, offering a practical path to deploying trustworthy, KG-grounded AI systems.

pdf bib abs

We present a deployed system that automates end-to-end customer support workflows inside an enterprise Business Process Management (BPM) platform. The approach is scalable in production and reaches selective automation within two weeks for a new process, leveraging supervision already generated at scale: structured per-case UI interaction traces and low-overhead copilot feedback, where operators either accept a suggestion or provide a correction. A staged deployment pipeline trains a next UI action policy, learns a critic from copilot feedback to calibrate abstention, and executes only high-confidence steps in the background while deferring uncertain decisions to operators and resuming from the updated UI state. This setup lets one operator supervise multiple concurrent sessions and be interrupted only when the system is uncertain. The system operates on a schema-driven view of the BPM interface and includes monitoring and safe fallbacks for production. In production, it automated 45% of sessions and reduced average handling time by 39% without degrading support quality level.

pdf bib abs

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
Sanket Badhe | Deep Shah

Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model’s System Prompt. Evaluated using Gemma-3 4B, PLD improved Macro F1 scores on StereoSet (57% to 90.0%) and Contract-NLI (67% to 83%), while increasing LogiQA accuracy to 70%. Similar results on Mistral Small 3.1 demonstrate cross-architecture generalizability, enabling these compact models to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.

pdf bib abs

MTIVE: Multi-Task Image Verification Engine Using Vision-Language Models for E-commerce
Yu-Tong Cao | Vishnu Prabhakaran | Arunita Das | Purav Aggarwal | Anoop Saladi

Vision-language models show promise for e-commerce automation but struggle with noisy real-world images and multi-task requirements. We introduce MTIVE, a curriculum learning framework that progressively adapts base models through three stages: continued pre-training on large-scale e-commerce datasets with contrastive learning and diverse dialogue templates, instruction tuning on synthetic data, and modular task-specific expert training. Our architecture uses frozen base weights with stacked LoRA adapters—shared modules for domain knowledge and lightweight task-specific experts—enabling continual learning without catastrophic forgetting. MTIVE outperforms open-source and proprietary baselines in both standard and continual learning settings.

pdf bib abs

Measuring and Mitigating Racial Bias in Embedding Models: A Comparative Study for Law Enforcement Retrieval
Archan Dutta

Embedding models are often used for semantic retrieval in high-stakes domains such as law enforcement, where biased outputs can have severe consequences. We systematically measure racial bias in six widely used embedding models by computing similarity scores between crime incident texts that include racial identity tokens and simple law enforcement queries. The analysis reveals that racial descriptors consistently affect cosine similarity scores and retrieval rankings for semantically identical crime incidents. All models exhibit statistically significant bias, with magnitude varying across models. This study provides a comprehensive methodology and metrics to aid the selection of embedding models when deploying NLP-based systems in the law enforcement domain. Organizations can reduce bias at low cost through informed model selection. The methodology establishes reproducible metrics for measuring bias in embedding-based systems.

pdf bib abs

VishBox v2: A Multi-Agent System for Adaptive Voice Phishing Simulation
Sungmi Park | Daon Choi | Yoonmo Yang | Hong Yunyi | Heedou Kim

Voice phishing is a multi-round social engineering attack in which strategy and victim psychology co-evolve, yet real transcripts are rarely accessible for systematic analysis. We present VishBox v2, a multi-agent architecture that generates structured phishing simulations grounded in crime-script procedures and persuasion principles. A Main Agent orchestrates a Dialogue Agent and a Tactic Search Agent, combining multi-round dialogue generation, web-based tactic mining, and emotion-driven vulnerability tracking. Across 571 rounds, results including police-expert evaluation support procedural realism and show that VishBox v2 captures tactic concentration, vulnerability transitions, and web-search-induced procedural disruptions. The framework provides a controlled foundation for safer red-teaming and security training research.

pdf bib abs

ResoDiff-44k: High-Fidelity Cross-Lingual Speech and Singing Synthesis via Discrete Diffusion
Gyanendra Das | Sai Satyam Jena

While large-scale generative speech models have achieved remarkable semantic coherence, industrial deployment remains constrained by a fidelity ceiling typically capped at lower sampling rates. A fundamental limitation is the reliance on intermediate mel-spectrograms, a low-dimensional bottleneck that discards phase and high-frequency information, causing artifacts in expressive scenarios like singing. In this work, we introduce ResoDiff-44k, a production-grade generative foundation model designed for cinema-quality, 44.1kHz audio synthesis. Departing from standard masked audio modeling and mel-spectrogram inversion, ResoDiff-44k leverages Discrete Diffusion over a pure Descript Audio Codec latent space. We pre-train ResoDiff-44k on a massive 150K -hour multilingual dataset to establish a robust acoustic prior, followed by targeted fine-tuning on a curated regional mixed-language and singing corpus. Our experiments demonstrate that replacing the standard prediction head with a discrete diffusion trajectory significantly reduces misalignment in long sequences. We report a double-blind subjective evaluation showing that ResoDiff-44k achieves a 4.6 Mean Opinion Score in 44.1kHz singing synthesis and a 71% reduction in character error rate on regional mixed-language prompts compared to strong baselines. The proposed pipeline offers a viable path for deploying high-fidelity, culturally adaptive conversational agents.

pdf bib abs

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event unification engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

pdf bib abs

While Large Language Models (LLMs) are increasingly integrated into in-vehicle conversational systems, identifying the optimal model remains challenging due to the lack of domain-specific evaluation standards tailored to real-world deployment requirements. In this paper, we propose a novel evaluation framework for in-vehicle assistants, with a particular focus on Korean-language localization. Our empirical analysis reveals notable patterns in model behavior. First, fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated in localization settings. Second, models exhibit weaker performance in strategic conversational metrics like clarification and proactivity. Our analysis suggests this stems from the inherent subjective complexity of these tasks, where our framework adopts a conservative evaluation stance to prioritize reliability. Together, our findings underscore that automotive AI must move beyond general competence toward precise linguistic tailoring and reliable, safety-oriented interaction management.

pdf bib abs

Index-Time Prefix Injection for Multi-Tenant Retrieval: Improving Search Relevance Without Model Fine-Tuning
Vaibhav Varshney | Manjunatha Naik MC

Multi-tenant enterprise search platforms serve hundreds of customers through a single shared retrieval model. Fine-tuning on individual customer data is typically prohibited by contractual and regulatory constraints, and maintaining per-customer models does not scale. We present index-time prefix injection, a training-free method that improves retrieval relevance by prepending domain-descriptive natural-language prefixes to documents during indexing. For example, prepending "IT service management knowledge article:" to an IT knowledge base shifts its embeddings into a tighter, more domain-coherent region of the vector space. Prefixes are discovered through a tiered strategy: LLM-based generation from document samples when data policies allow, domain-expert curation when they do not, and a standardized prefix library as fallback. Deployed across 18 languages and 400+ customer instances, the approach yields 3–8% Hit@5 improvements with zero model training. A/B tests confirm a 4.2% CTR lift. We describe the system design, evaluation at scale, and deployment lessons including failure modes.

pdf bib abs

Benchmarking Testing in Automated Theorem Proving
Jongyoon Kim | Hojae Han | Seung-Won Hwang

Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof,or expensive manual inspection.Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T², a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing.We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort.Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric.The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.

pdf bib abs

Graph-Based Phonetic Error Correction of Noisy ASR
Pratik Rakesh Singh | Mohammadi Zaki | Aneesh Mukkamla | Pankaj Wasnik

Automatic speech recognition (ASR) systems, despite low overall word error rates, produce residual lexical errors that disproportionately affect semantically critical tokens such as named entities, negations, and sentiment-bearing words. These errors are often structured, arising from phonetic similarity rather than random noise, making naive token-level correction insufficient.We propose a structured ASR correction framework, that we call G-SPIN, that combines phonetic graph modeling with contextual language understanding. A graph neural network (GNN) first constructs acoustically plausible candidate neighborhoods for flagged tokens, explicitly restricting the correction search space to phonetic alternatives. A masked language model (MLM) then provides local contextual scoring, and an instruction-tuned large language model (LLM) performs final context-aware re-ranking over this compact candidate set. By decoupling structured phonetic reasoning from contextual semantic selection, our method avoids unconstrained generation while improving correction accuracy. The framework is lightweight, modular, and operates entirely at inference time.

pdf bib abs

CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models
Qizhi Jiang | Shuo Wang | Pei Ke | Yuhang Song | Ke Qin

Large Reasoning Models (LRMs) have achieved remarkable success on complex tasks by leveraging long chain-of-thought (CoT) trajectories, yet they frequently exhibit overthinking on simple queries, resulting in significant token overhead and reduced inference efficiency. However, existing compression methods predominantly apply uniform length reduction or rely on coarse-grained difficulty estimation, often leading to performance degradation on difficult problems. To address this limitation, we propose Confidence-Adaptive Thinking (CAT), a framework that incorporates the model’s intrinsic self-certainty signals as confidence into the preference optimization process, which autonomously modulates reasoning lengths based on problem difficulty. Experimental results show that CAT consistently outperforms state-of-the-art baselines on reasoning accuracy across multiple benchmarks on different base models. Our work enables LRMs to effectively compress confident responses while deliberating on uncertain ones, offering a potentially robust solution for balancing accuracy and latency in practical industrial scenarios.

pdf bib abs

LatentGate: Low-Latency Semantic Routing via Frozen-Backbone Probing of Small Language Models
Shivam Ratnakar | Abhiroop Talasila | Vinayak K Doifode

As Multi-Agent Systems scale to hundreds of specialized agents, routing becomes a critical bottleneck. Prompt-based LLM routers deliver strong semantic reasoning but incur prohibitive latency (~1500–2000ms) and cost that scales with agent count, while embedding-based routers are fast (25–50ms) but collapse semantically similar yet functionally distinct agents. We identify *representation anisotropy*, the geometric collapse of hidden-state vectors into a narrow cone, as a key mechanism underlying embedding-based routing failure. We propose **LatentGate**, a non-generative router that extracts mean-pooled hidden states from a frozen small language model (SLM), applies PCA-whitening to resolve the anisotropy, and trains a lightweight linear probe for agent classification. Across 5 SLM backbones and 100 enterprise agents, LatentGate achieves 98.8% in-domain and 80.0% OOD accuracy on natural queries, 13–22 absolute points above embedding baselines, and 92.9% on CLINC150. It takes ~28ms to run on a T4 GPU, with the SLM forward pass independent of agent count and classification adding a negligible O(Ck) term. We demonstrate the potential of using a lightweight linear probe to enable sub-10ms warm-start retraining from user feedback, providing a foundation for continual learning in production environments. Benchmarking prompt-based routing with GPT-4.1, GPT-4.1-nano, and Gemini 2.5 Flash confirms degradation to 70–77% accuracy at 100 agents with 1500–2000ms latency, motivating non-generative alternatives.

pdf bib abs

Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliberation and trial-and-error exploration, and to retrieve these skills at inference time to guide future reasoning. Unlike the prevailing reasoning from scratch paradigm, our approach first recalls relevant skills for each query, helping the model avoid redundant detours and focus on effective solution paths. We evaluate our method on coding and mathematical reasoning tasks, and find that it significantly reduces reasoning tokens while improving overall performance. The resulting lower per-request cost indicates strong practical and economic potential for real-world deployment.