Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Yevgen Matusevych, Gülşen Eryiğit, Nikolaos Aletras (Editors)

Anthology ID:: 2026.eacl-industry
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-industry/
DOI:
ISBN:: 979-8-89176-384-5
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-industry.pdf

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Yevgen Matusevych | Gülşen Eryiğit | Nikolaos Aletras

pdf bib abs

Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration
Guangxin Wu | Hao Zhang | Zhang Zhibin | Jiafeng Guo | Xueqi Cheng

Large Language Models (LLMs) have achieved remarkable success across a wide spectrum of natural language processing tasks. However, their ever-growing scale introduces significant barriers to real-world deployment, including substantial computational overhead, memory footprint, and inference latency. While model pruning presents a viable solution to these challenges, existing unstructured pruning techniques often yield irregular sparsity patterns that necessitate specialized hardware or software support. In this work, we explore structured pruning, which eliminates entire architectural components and maintains compatibility with standard hardware accelerators. We introduce a novel structured pruning framework that leverages a hybrid multi-domain calibration set and an iterative calibration strategy to effectively identify and remove redundant channels. Extensive experiments on various models across diverse downstream tasks show that our approach achieves significant compression with minimal performance degradation.

pdf bib abs

SCRIPTMIND: Crime Script Inference and Cognitive Evaluation for LLM-based Social Engineering Scam Detection System
Heedou Kim | Changsik Kim | Sanghwa Shin | Jaewoo Kang

Social engineering scams increasingly employ personalized, multi-turn deception, exposing the limits of traditional detection methods. While Large Language Models (LLMs) show promise in identifying deception, their cognitive assistance potential remains underexplored. We propose ScriptMind, an integrated framework for LLM-based scam detection that bridges automated reasoning and human cognition. It comprises three components: the Crime Script Inference Task (CSIT) for scam reasoning, the Crime Script–Aware Inference Dataset (CSID) for fine-tuning small LLMs, and the Cognitive Simulation-based Evaluation of Social Engineering Defense (CSED) for assessing real-time cognitive impact. Using 571 Korean scam cases, we built 22,712 structured scammer-sequence training instances. Experimental results show that the 11B small LLM fine-tuned with ScriptMind outperformed GPT-4o by 13%, achieving superior performance over commercial models in detection accuracy, false-positive reduction, scammer utterance prediction, and rationale quality. Moreover, in phone scam simulation experiments, it significantly enhanced and sustained users’ suspicion levels, improving their cognitive awareness of scams. ScriptMind represents a step toward human-centered, cognitively adaptive LLMs for scam defense.

pdf bib abs

From Paper to Structured JSON: An Agentic AI Workflow for Compliant BMR Digital Transformation
Bhavik Agarwal | Nidhi Bendre | Viktoria Rojkova

Agentic AI workflow converts noisy pharmaceutical batch records into validated JSON using hybrid OCR, vision–language and schema-guided LLMs, cutting QA review from hours to minutes while preserving GMP-critical structure.

pdf bib abs

Compact Multimodal Language Models as Robust OCR Alternatives for Noisy Textual Clinical Reports
Nikita Neveditsin | Pawan Lingras | Salil Patil | Swarup Patil | Vijay Kumar Mago

Digitization of medical records often relies on smartphone photographs of printed reports, producing images degraded by blur, shadows, and other noise. Conventional OCR systems, optimized for clean scans, perform poorly under such real-world conditions. This study evaluates compact multimodal language models as privacy-preserving alternatives for transcribing noisy clinical documents. Using obstetric ultrasound reports written in regionally inflected medical English common to Indian healthcare settings, we compare eight systems in terms of transcription accuracy, noise sensitivity, numeric accuracy, and computational efficiency. Compact multimodal models consistently outperform both classical and neural OCR pipelines. Despite higher computational costs, their robustness and linguistic adaptability position them as viable candidates for on-premises healthcare digitization.

pdf bib abs

Digital footprints—records of individuals’ interactions with digital systems—are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.

pdf bib abs

We introduce EPAG, a benchmark dataset and framework designed for evaluating the pre-consultation ability of LLMs using diagnostic guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.

pdf bib abs

SELENE: Selective and Evidence-Weighted LLM Debating for Efficient and Reliable Reasoning
Akshay Verma | Swapnil Gupta | Deepak Gupta | Prateek Sircar | Siddharth Pillai

Multi-Agent Debate (MAD) frameworks improve factual reliability in large language models (LLMs) by allowing agents to critiqueand refine one another’s reasoning. Yet, existing MAD systems are computationally expensive and prone to degradation under pro-longed debates due to redundant exchanges and unstable judging. We propose a lightweight,industry-deployable alternative that unifies Selective Debate Initiation (SDI) with Evidence Weighted Self-Consistency (EWSC) for adaptive, debate-on-demand reasoning. SDI dynamically predicts when debate is necessary by detecting confidence-likelihood misalignment and semantic disagreement, skippingwell-aligned queries to conserve computation. EWSC replaces a single-judge verdict with a variance-aware, evidence-weighted aggregation across paraphrased evaluations, yielding more stable factual judgments. Combined, SDI and EWSC reduce token consumption by nearly 50% while improving both accuracy and calibration. Evaluated on BoolQ, CosmosQA, and an internal QnA benchmark, our framework achieves higher factual robustness and efficiency, demonstrating that scalable, epistemically reliable multi-agent reasoning is practical for real-world LLM deployments.

pdf bib abs

We introduce SymPyBench, a large-scale synthetic benchmark of 15K university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. In addition to standard accuracy, we introduce three new metrics: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems.

pdf bib abs

KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference
Sai Gokhale | Devleena Das | Rajeev Patwari | Ashish Sirasao | Elliott Delaye

Long-context Large Language Models (LLMs) face significant memory bottlenecks during inference due to the linear growth of key-value (KV) cache with sequence length. While individual optimization techniques like KV cache quantization, chunked prefill, and model weight quantization have shown promise, their joint effects and optimal configurations for edge deployment remain underexplored. We introduce KV Pareto, a systems-level framework that systematically maps the trade-off frontier between total memory consumption and task accuracy across these three complementary optimization techniques. Our framework evaluates multiple LLM architectures (Qwen, Llama, Mistral) with varying KV quantization schemes (int2/4/8, mixed-precision), granularities (per-token, per-tensor, per-block), and 4-bit weight quantization via AWQ. Our framework identifies model-specific Pareto-optimal configurations that achieve 68-78% total memory reduction with minimal (1-3%) accuracy degradation on long-context tasks. We additionally verify the selected frontiers on additional benchmarks of Needle-in-a-Haystack, GSM8k and MMLU as well as extended context lengths of up to 128k to demonstrate the practical need of joint optimization for efficient LLM inference.

pdf bib abs

MizanQA: A Benchmark for Multi-Answer Moroccan Legal QA
Adil Bahaj | Mounir Ghogho

We present MizanQA, a benchmark for assessing LLMs on Moroccan legal MCQs, many with multiple correct answers. Covering 1,776 expert-verified questions in Modern Standard Arabic enriched with Moroccan idioms, the dataset reflects influences from Maliki jurisprudence, customary law, and French legal traditions. Unlike single-answer settings, MizanQA features variable option counts, creating added difficulty. We evaluate multilingual and Arabic-centric models in zero-shot, native-Arabic prompts, measuring accuracy, a precision-penalized F1-like score, and calibration errors. Results show large performance gaps and miscalibration, particularly under stricter penalties. By scoping this benchmark to parametric knowledge only, we provide a baseline for future retrieval-augmented and rationale-focused setups.

pdf bib abs

Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.

pdf bib abs

Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS
Mahta Fetrat Qharabagh | Donya Navabi | Zahra Dehghanian | Morteza Abolghasemi | Hamid R. Rabiee

Lightweight, real-time text-to-speech systems are crucial for accessibility. However, the most efficient TTS models often rely on lightweight phonemizers that struggle with context-dependent challenges. In contrast, more advanced phonemizers with a deeper linguistic understanding typically incur high computational costs, which prevents real-time performance. This paper examines the trade-off between phonemization quality and inference speed in G2P-aided TTS systems, introducing a practical framework to bridge this gap. We propose lightweight strategies for context-aware phonemization and a service-oriented TTS architecture that executes these modules as independent services. This design decouples heavy context-aware components from the core TTS engine, effectively breaking the latency barrier and enabling real-time use of high-quality phonemization models. Experimental results confirm that the proposed system improves pronunciation soundness and linguistic accuracy while maintaining real-time responsiveness, making it well-suited for offline and end-device TTS applications.

pdf bib abs

Retrieval-Augmented Generation (RAG) systems depend critically on retrieval quality to enable accurate, contextually relevant LLM responses. While LLMs excel at synthesis, their RAG performance is bottlenecked by document relevance. We evaluate advanced retrieval techniques including embedding model comparison, Reciprocal Rank Fusion (RRF), embedding concatenation and list-wise and adaptive LLM-based re-ranking, demonstrating that zero-shot LLMs outperform traditional cross-encoders in identifying high-relevance passages. We also explore context-aware embeddings, diverse chunking strategies, and model fine-tuning. All methods are rigorously evaluated on a proprietary dataset powering our deployed production chatbot, with validation on three public benchmarks: FiQA, HotpotQA, and SciDocs. Results show consistent gains in Recall@10, closing the gap with Recall@50 and yielding actionable pipeline recommendations. By prioritizing retrieval enhancements, we significantly elevate downstream LLM response quality in real-world, customer-facing applications.

pdf bib abs

Scaling Intent Understanding: A Framework for Classification with Clarification using Lightweight LLMs
Subhadip Nandi | Tanishka Agarwal | Anshika Singh | Priyanka Bhatt

Despite extensive research in intent classification, most task-oriented dialogue systems still rigidly assign intents to user utterances without addressing ambiguity, often leading to misrouted requests, irrelevant responses, and user frustration. Proprietary large-language models (LLMs) can generate effective clarifying questions but are too costly for large-scale deployment. Smaller open-source LLMs are more economical, but struggle to ask appropriate clarifying questions. This paper introduces a domain-agnostic framework that equips lightweight, production-ready open-source LLMs with the ability to perform intent classification alongside precise ambiguity resolution via clarifying questions. We validate our framework on both proprietary and public intent classification datasets, demonstrating its ability to perform intent classification as well as generate clarification questions in case of ambiguity. To compare models, those trained with our framework and external baselines, we also propose an evaluation methodology that jointly assesses the accuracy of intent classification and the timing and quality of clarifying questions. Our instruction-tuned models achieve performance comparable to leading proprietary LLMs while offering an 8X reduction in inference cost, enabling broader, cost-efficient deployment. When deployed in the customer-care system of an e-commerce enterprise, our model reduced the misrouting rate by 8%, resulting in a significant improvement in automation rates, which potentially translates in dollar savings by reducing escalations to human agents.

pdf bib abs

Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
Sumanth Balaji | Piyush Mishra | Aashraya Sachdeva | Suraj Agrawal

Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent’s capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.

pdf bib abs

Agentic search has emerged as a promising paradigm for adaptive retrieval systems powered by large language models (LLMs). However, existing benchmarks primarily focus on quality, overlooking efficiency factors that are critical for real-world deployment. Moreover, real-world user queries often contain underspecified preferences, a challenge that remains largely underexplored in current agentic search evaluation. As a result, many agentic search systems remain impractical despite their impressive performance.In this work, we introduce HotelQuEST, a benchmark comprising 214 hotel search queries that range from simple factual requests to complex queries, enabling evaluation across the full spectrum of query difficulty. We further address the challenge of evaluating underspecified user preferences by collecting clarifications that make annotators’ implicit preferences explicit for evaluation.We find that LLM-based agents achieve higher accuracy than traditional retrievers, but at substantially higher costs due to redundant tool calls and suboptimal routing that fails to match query complexity to model capability. Our analysis exposes inefficiencies in current agentic search systems and demonstrates substantial potential for cost-aware optimization.

pdf bib abs

TASER: Table Agents for Schema-guided Extraction and Recommendation
Nicole Cho | Kirsty Fielding | William Watson | Sumitra Ganesh | Manuela Veloso

Real-world financial filings report critical information about an entity’s investment holdings, essential for assessing that entity’s risk, profitability, and relationship profile. Yet, these details are often buried in messy, multi-page, fragmented tables that are difficult to parse, hindering downstream QA and data normalization. Specifically, 99.4% of the tables in our financial table dataset lack bounding boxes, with the largest table spanning 44 pages. To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Guided by an initial portfolio schema, TASER executes table detection, classification, extraction, and recommendations in a single pipeline. Our Recommender Agent reviews unmatched outputs and proposes schema revisions, enabling TASER to outperform vision-based table detection models such as Table Transformer by 10.1%. Within this continuous learning process, larger batch sizes yield a 104.3% increase in useful schema recommendations and a 9.8% increase in total extractions. To train TASER, we manually labeled 22,584 pages and 3,213 tables covering 731.7 billion in holdings, culminating in TASERTab to facilitate research on real-world financial tables and structured outputs. Our results highlight the promise of continuously learning agents for robust extractions from complex tabular data.

pdf bib abs

TAGQuant: Token-Aware Clustering for Group-Wise Quantization
Jaeseong Lee | Seung-won Hwang | Aurick Qiao | Zhewei Yao | Yuxiong He

Grouping, e.g., grouping channels, which is widely used in current integer-based quantization, has become essential for the emerging MXFP4 format. Ideally, each group should contain channels with similar quantization scales. To guide such groups, existing work clusters the channels using scalar proxy, ignoring the token dimension, which we find suboptimal. In this paper, we propose TAGQuant, a simple yet powerful enhancement for such “group-wise” quantization. By strategically shuffling channels to group those with similar token-wise activation distributions, TAGQuant ensures better clustering of large- and small-range values. This shuffle operation is hardware-efficient, and seamlessly integrated into the quantization process with only 0.01x latency overhead. TAGQuant reduces relative GSM8K error in both INT4 and MXFP4 formats, by up to 86% in Llama-3.1-8B-Instruct compared to baselines, validating the effectiveness of our channel shuffling approach for group-wise quantization. Code is publicly available.

pdf bib abs

Beyond Grid Search: Leveraging Bayesian Optimization for Accelerating RAG Pipeline Optimization
Anum Afzal | Xueru Zheng | Florian Matthes

Finding optimal configurations for Retrieval-Augmented Generation (RAG) pipelines via grid search is computationally prohibitive, limiting real-world scalability. We investigate Bayesian Optimization (BO) as an efficient alternative, systematically comparing seven BO strategies combining four surrogate models and two multi-fidelity methods across FiQA, SciFact, and HotpotQA datasets. Our framework explores both global pipeline and local component-wise optimization, targeting final RAG performance and resource efficiency. Our results show that BO reduces optimization time by up to 84% compared to grid search while maintaining comparable accuracy, with local optimization offering the most practical balance for deployment. Notably, performance gains plateau with larger evaluation budgets, suggesting that moderate resource investments suffice for effective RAG tuning. We provide actionable guidelines that empower industry practitioners to efficiently configure and deploy high-performing RAG systems under real-world constraints.

pdf bib abs

BornoDrishti: Leveraging Vision Encoders and Domain-Adaptive Learning for Bangla OCR on Diverse Documents
S M Jishanul Islam | Md Mehedi Hasan | Masbul Haider Ovi | Akm Shahariar Azad Rabby | Fuad Rahman

OCR for Bangla scripts remains a challenging problem, with existing solutions limited to single-domain processing. Current approaches lack a unified vision encoder that can understand diverse Bangla script variations, hindering practical deployment. We present BornoDrishti, the first unified OCR system based on the vision transformer that accurately recognizes both printed and handwritten Bangla scripts within a single model. Our approach introduces a novel domain objective that enables the model to learn domain-invariant representations while preserving script-specific features, eliminating the need for separate domain experts. BornoDrishti achieves competitive accuracy across both domains, setting state-of-the-art performance for printed scripts and demonstrating that a single unified model can match or exceed specialized uni-domain systems. We evaluate our model against state-of-the-art domain-specific and cross-domain OCR systems. This work establishes a foundation for advancing practical applications by using a unified multi-domain OCR system for complex Bangla scripts.

pdf bib abs

MobileCity: An Efficient Framework for Large-Scale Urban Behavior Simulation
Xiaotong Ye | Nicolas Bougie | Toshihiko Yamasaki | Narimawa Watanabe

Generative agents offer promising capabilities for simulating realistic urban behaviors. However, existing methods often rely on static profiles, oversimplified behavioral logic, and synchronous inference pipelines that hinder scalability. We present MobileCity, a lightweight generative-agent framework for city-scale simulation powered by cognitively-grounded generative agents. Each agent acts based on its needs, habits, and obligations, evolving over time. Agents are initialized from survey-based demographic data and navigate a realistic multimodal transportation network spanning multiple types of vehicles. To achieve scalability, we introduce asynchronous batched LLM inference during action selection and a low-token communication mechanism. Experiments with 4,000 agents demonstrate that MobileCity generates more human-like urban dynamics than baselines while maintaining high computational efficiency. Our code is publicly available at https://github.com/Tony-Yip/MobileCity.

pdf bib abs

When applying LLMs to real-world enterprise operations, LLMs need to handle proprietary knowledge in small domains of specific operations (micro domains).A previous study shows micro domain-adaptive pre-training (mDAPT) with fewer documents is effective, similarly to DAPT in larger domains.However, it evaluates mDAPT only on multiple-choice questions; thus, its effectiveness for generative tasks in real-world operations remains unknown.We aim to reveal the potential and bottlenecks of mDAPT for generative tasks.To this end, we disentangle the answering process into three subtasks and evaluate the performance of each subtask: (1) eliciting facts relevant to questions from an LLM’s own knowledge, (2) reasoning over the facts to obtain conclusions, and (3) composing long-form answers based on the conclusions.We verified mDAPT on proprietary IT product knowledge for real-world questions in IT technical support operations. As a result, mDAPT resolved the elicitation task that the base model struggled with but did not resolve other subtasks.This clarifies mDAPT’s effectiveness in the knowledge aspect and its bottlenecks in other aspects.Further analysis empirically shows that resolving the elicitation and reasoning tasks ensures sufficient performance (over 90%), emphasizing the need to enhance reasoning capability.

pdf bib abs

A Compliance-Preserving Retrieval System for Aircraft MRO Task Search
Byungho Jo

Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals—a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources.We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers.The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers.Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time—from 6-15 minutes to 18 seconds per task.These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.

pdf bib abs

No Label? No Problem: Unsupervised Continual Learning for Adaptive Medical ASR
Meizhu Liu | Tao Sheng

Automatic Speech Recognition (ASR) plays an important role in healthcare but faces unique challenges. Medical audio often contains specialized terminology, such as medication names, which existing ASR systems struggle to transcribe accurately. High error rates arise from pronunciation variability, the continual introduction of new terms, and the scarcity of high-quality labeled data—whose collection is costly and requires medical expertise. Although synthetic datasets partially alleviate this problem, they fail to capture the noise and variability of real-world recordings. Moreover, ASR models trained in controlled environments are highly sensitive to noise, leading to degraded performance in clinical settings. To address these limitations, we propose an unsupervised continual learning ASR framework that adapts to new data while preserving prior knowledge. This enables efficient domain adaptation without extensive retraining. Experiments on real-world medical audio demonstrate significant improvements over state-of-the-art baselines.

pdf bib abs

EduPulse: A Practical LLM-Enhanced Opinion Mining System for Vietnamese Student Feedback in Educational Platforms
Nguyen Xuan Phuc | Phi Nguyen Xuan | Vinh-Tiep Nguyen | Thìn Đặng Văn | Ngan Luu-Thuy Nguyen

Opinion mining from real-world student feedback presents significant practical challenges, such as handling linguistic noise (slang, teencode) and the need for scalable and maintainable systems, which are often overlooked in academic research. This paper introduces EduPulse, a practical opinion mining system designed specifically to analyze student feedback in Vietnamese. Our application performs four opinion analysis tasks, including Sentiment Classification, Category-based Sentiment Classification, Suggestion Detection, and Opinion Summarization. We design the hybrid architecture that strategically balances performance, cost, and maintainability. This architecture leverages the robustness of Large Language Models (LLMs) for complex, noise-sensitive tasks as sentiment classification and suggestion detection, while employing a specialized, lightweight neural model for high-throughput, low-cost solutions. Our experiments show that applying the LLM-based approach achieves high robustness, justifying its operational cost by eliminating expensive retraining cycles. Furthermore, we demonstrate that our collaborative modular architecture significantly improves task performance (+7.6%) compared to traditional approaches, offering a practical design for industry-focused Natural Language Processing applications.

pdf bib abs

When Speed Meets Intelligence: Scalable Conversational NER in an Ever-evolving World
Karim Ghonim | Antonio Roberto | Davide Bernardi

Modern conversational AI systems require sophisticated Named Entity Recognition (NER) capabilities that can handle complex, contextual dialogue patterns. While Large Language Models (LLMs) excel at understanding conversational semantics, their inference latency and inability to efficiently incorporate emerging entities make them impractical for production deployment. Moreover, the scarcity of conversational NER data creates a critical bottleneck for developing effective models.We address these challenges through two main contributions. First, we introduce an automated pipeline for generating multilingual conversational NER datasets with minimal human validation, producing 4,082 English and 3,925 Spanish utterances. Second, we present a scalable framework that leverages LLMs as semantic filters combined with catalog-based entity grounding to label live traffic data, enabling knowledge distillation into faster, production-ready models. On internal conversational datasets, our teacher model demonstrates 39.55% relative F1-score improvement in English and 44.93% in Spanish compared to production systems. On public benchmarks, we achieve 97.12% F1-score on CoNLL-2003 and 83.09% on OntoNotes 5.0, outperforming prior state-of-the-art by 24.82 and 8.19 percentage points, respectively. Finally, student models distilled from our teacher approach achieve 13.84% relative improvement on English conversational data, bridging the gap between LLM capabilities and real-world deployment constraints.

pdf bib abs

ReflectiveRAG: Rethinking Adaptivity in Retrieval-Augmented Generation
Akshay Verma | Swapnil Gupta | Siddharth Pillai | Prateek Sircar | Deepak Gupta

Retrieval-Augmented Generation (RAG) systems degrade sharply under extreme noise,where irrelevant or redundant passages dominate. Current methods-fixed top-k retrieval, cross-encoder reranking, or policy based iteration-depend on static heuristics orcostly reinforcement learning, failing to assess evidence sufficiency, detect subtle mismatches, or reduce redundancy, leading to hallucinations and poor grounding. We introduce ReflectiveRAG, a lightweight yet reasoning-driven architecture that enhances factual grounding through two complementary mechanisms: Self-Reflective Retrieval (SRR) and Contrastive Noise Removal (NR). SRR employs small language model as a decision controller that iteratively evaluates evidence sufficiency, enabling adaptive query reformulation withoutfixed schedules or policy training. NR further refines retrieved content via embedding-based contrastive filtering, enforcing semanticsparsity and removing redundant or tangential passages. Evaluated on WebQuestions, HotpotQA (distractor setting) and InternalQAwith 50M Common Crawl distractors, ReflectiveRAG achieves substantial gains over strong baselines-including DeepRAG-improving EMby +2.7 pp and F1 by +2.5 pp, while reducing evidence redundancy by 30.88% with only 18 ms additional latency. Ablation studies con-firm that SRR and NR jointly drive both factual accuracy and efficiency, validating our central claim that retrieval reasoning and contrastivefiltering can outperform large-scale policy optimization in RAG.

pdf bib abs

OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
Jiyuan Shen | Yuan Peiyue | Atin Ghosh | Yifan Mai | Daniel Dahlmeier

Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline—while simpler—can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.

pdf bib abs

PatentVision: A multimodal method for drafting patent applications
Ruo Yang | Sai Krishna Reddy Mudhiganti | Manali Sharma

Patent drafting is complex due to its need for detailed technical descriptions, legal compliance, and visual elements. Although Large Vision-Language Models (LVLMs) show promise across various tasks, their application in automating patent writing remains underexplored. In this paper, we present PatentVision, a multimodal framework that integrates textual and visual inputs—such as patent claims and drawings—to generate complete patent specifications. Built on advanced LVLMs, PatentVision enhances accuracy by combining fine-tuned vision-language models with domain-specific training tailored to patents. Experiments reveal it surpasses text-only methods, producing outputs with greater fidelity and alignment with human-written standards. Its incorporation of visual data allows it to better represent intricate design features and functional connections, leading to richer and more precise results. This study underscores the value of multimodal techniques in patent automation, providing a scalable tool to reduce manual workloads and improve consistency. PatentVision not only advances patent drafting but also lays groundwork for broader use of LVLMs in specialized areas, potentially transforming intellectual property management and innovation processes.

pdf bib abs

Multimodal Large Language Models (MLLMs) struggle with Long Video Understanding (LVU) due to their limited context window and the distributed nature of salient information across many redundant frames. To address this, we present VideoMind, a novel training free framework for LVU designed to mimic a human reasoning process. The framework is orchestrated by an MLLM that breaks down a user’s query into a series of simpler, actionable sub-queries. For each sub query, the MLLM reconfigures itself by invoking specialized ‘modes’ that are instantiations of the same MLLM, but with appropriately tailored context for the given sub query to extract targeted evidence. After gathering this evidence, the model resumes its role as the orchestrator which evaluates the results and decides if an answer is complete or if it must refine its strategy by engaging further modes with new context. Our specialized operational modes include: 1) a Multi-Scale Temporal Search mode to identify and summarize relevant video sub-snippets at varying time scales, and 2) a Single-Frame Visual Detail mode for precise spatial localization of objects. This dynamic allocation of computation yields state-of-the-art results on the Video-MME, LongVideo, and MLVU benchmarks, achieving 77.6% performance on Video MME using Qwen 2.5 72B (4.8% enhancement) while also yielding a 5% improvement on Llama 4 Scout.

pdf bib abs

RegNLI: Detecting Online Product Misbranding through Legal and Linguistic Alignment
Diya Saha | Abhishek Bharadwaj Varanasi | Tirthankar Dasgupta | Manjira Sinha

Misbranding of health-related products poses significant risks to public safety and regulatory compliance. Existing approaches to claim verification largely rely on keyword matching or generic text classification, failing to capture the nuanced reasoning required to align product claims with legal statutes. In this work, we introduce RegNLI, a novel framework that formulates misbranding detection as a inference task between product claims and regulatory provisions. Leveraging a curated dataset of FDA warning letters, we construct structured representations of claims and statutes. Our model integrates a regulation-aware gating mechanism with a contrastive alignment objective to jointly optimize misbranding classification and statute mapping. Experiments on the FDA-Misbrand dataset demonstrate that RegNLI significantly outperforms strong baselines across accuracy, F1-score, and regulation alignment metrics, while providing interpretable attention patterns that highlight critical linguistic cues. This work establishes a foundation for compliance-aware NLP systems and opens new directions for integrating formal reasoning with neural architectures in regulatory domains.

pdf bib abs

CASPER: Bridging Discrete and Continuous Prompt Optimization through Feedback-Guided Gradient Descent
Aryan Jain | Pushpendu Ghosh | Promod Yenigalla

Workflow automation is critical for reducing manual efforts in industries, yet existing pipelines fail to handle generative tasks like summarization and extraction without pre-built tools, forcing human intervention. While LLM-based agents offer solutions, their creation depends heavily on prompt engineering—a resource-intensive process often yielding suboptimal results. Current automated approaches face a fundamental trade-off: discrete optimization produces overfitted prompts without convergence guarantees due to non-convex landscapes, while continuous gradient-based methods generate semantically incoherent prompts through embedding optimization. We propose CASPER, a framework bridging discrete and continuous prompt optimization through feedback-guided gradient descent in embedding space. CASPER employs a feedback module producing detailed error analyses that capture failure modes as optimization signals. These insights are projected with prompt tokens into embedding space to steer gradient descent. To preserve interpretability, we incorporate fluency regularization that penalizes incomprehensible tokens. We further accelerate convergence through synthetic data generation that oversamples failure cases, while also addressing data scarcity in industrial settings. We evaluate CASPER on WDC, DROP, GSM8K with F1 improvements of 2.3%, 1.6%, 2.3% and VQA, internal benchmarks showing accuracy improvements of 1.1%, 3%, demonstrating cross-domain generalizability.

pdf bib abs

Enterprise AI agents must continuously adapt to maintain accuracy, reduce latency, and remain aligned with user needs. We present a practical implementation of a data flywheel in NVInfo AI, NVIDIA’s Mixture-of-Experts (MoE) Knowledge Assistant serving over 30,000 employees. By operationalizing a MAPE-driven data flywheel, we built a closed-loop system that systematically addresses failures in retrieval-augmented generation (RAG) pipelines and enables continuous learning.Over a 3-month post-deployment period, we monitored feedback and collected 495 negative samples. Analysis revealed two major failure modes: routing errors (5.25%) and query rephrasal errors (3.2%). Using NVIDIA NeMo Microservices, we implemented targeted improvements through fine-tuning. For routing, we replaced a Llama 3.1 70B model with a fine-tuned 8B variant, achieving 96% accuracy, a 10× reduction in model size, and 70% latency improvement. For query rephrasal, fine-tuning yielded a 3.7% gain in accuracy and a 40% latency reduction.Our approach demonstrates how human-in-the-loop (HITL) feedback, when structured within a data flywheel, transforms enterprise AI agents into self-improving systems. Key learnings include approaches to ensure agent robustness despite limited user feedback, navigating privacy constraints, and executing staged rollouts in production. This work offers a repeatable blueprint for building robust, adaptive enterprise AI agents capable of learning from real-world usage at scale.

pdf bib abs

Medical Summarization in Practice: Design, Deployment, and Analysis of a Clinical Summarization System for a German Hospital
Moiz Rauf | Sean Papay

Through the course of hospital treatment, a large number of electronic health records (EHRs) are created for a patient, detailingaspects of care history such as lab results, physician notes, and treatments administered.At the conclusion of treatment, this collection of EHRs must be summarized into a discharge summary,describing the course or care clearly and cohesively.In this paper, we present the design and development of a clinical summarization system integrated into a live German hospital workflowto help with the generation of discharge summaries.We first describe the system, its components, and its context of use within a hospital,before performing a number of experiments to gain insights into how best to use and evaluate our system.We investigate summarization performance across multiple input encoding strategies, compare expert judgments against automatic evaluation of summaries,and analyze the consistency of model summaries across multiple text generations.This work not only acts as a case study to demonstrate the feasibility of LLM integration into healthcare infrastructure,but also provides actionable insights into the use and evaluation of such systems.

pdf bib abs

Feedback-Aware Prompt Optimization Framework for Generating Job Postings
Suraj Maharjan | Ainur Yessenalina | Srinivasan H. Sengamedu

Job postings are critical for recruitment, yet large enterprises struggle with standardization and consistency, requiring significant time from hiring managers and recruiters. We present a feedback-aware prompt optimization framework that automates high-quality job posting generation through iterative human-in-the-loop refinement. Our system integrates multiple data sources: job metadata, competencies, organization’s compliance guidelines, and organization brand statement, while incorporating human feedback to continuously improve prompt quality through multi-LLM validation. We evaluated our approach using LLM-as-a-judge on 1,056 job postings and human evaluation on a smaller subset across three dimensions: Standardization, Compliance, and User Perception. Our results demonstrate high compliance rates and strong satisfaction scores in both automated and human evaluation, validating the effectiveness of our feedback-aware approach for enterprise job posting generation.

pdf bib abs

The proliferation of multi-modal online advertisements necessitates robust content moderation to ensure user safety, as offensive ad content can cause user distress and erode platform trust. This paper addresses the detection of content that becomes offensive only when a user’s search query is paired with a specific ad, a context-dependent challenge that simple moderation often misses. Key challenges include the nuanced, multi-modal nature of ads, severe data scarcity and class imbalance due to the rarity of offensive content, and the high cost of human labeling. To overcome these limitations, we introduce a novel, context-aware detection framework centered on a large-scale, Multi-modal Teacher-Student Knowledge Distillation architecture. A powerful Gemini encoder-only “teacher” model distills its knowledge into a lightweight student model suitable for low-latency deployment. We enhance robustness using a novel graph mining technique to find rare offensive examples for training. For evaluation, we developed a highly accurate Automated Evaluation Model (AEM)—a separate, larger Gemini model utilizing Chain-of-Thought (CoT) reasoning—to rigorously assess performance in a live A/B test. Our results demonstrate that the proposed framework reduces the serving of offensive query-ad pairs by more than 80% compared to the baseline, while maintaining the efficiency required for real-time advertising systems that operate at a scale of over ≈100 billion query-ad pairs per day. Disclaimer: This paper contains sentences and images that may be offensive. These examples are included solely for scientific analysis and do not reflect the views of the authors.

pdf bib abs

SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models
Jiaojiao Han | Wujiang Xu | Mingyu Jin | Mengnan Du

Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE Agentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

pdf bib abs

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision–Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

pdf bib abs

As the performance of large language models (LLMs) continues to advance, their adoption in the medical domain is increasing. However, most existing risk evaluations largely focused on general safety benchmarks. In the medical applications, LLMs may be used by a wide range of users, ranging from general users and patients to clinicians, with diverse levels of expertise and the model’s outputs can have a direct impact on human health which raises serious safety concerns. In this paper, we introduce MedRiskEval, a medical risk evaluation benchmark tailored to the medical domain. To fill the gap in previous benchmarks that only focused on the clinician perspective, we introduce a new patient-oriented dataset called PatientSafetyBench containing 466 samples across 5 critical risk categories. Leveraging our new benchmark alongside existing datasets, we evaluate a variety of open- and closed-source LLMs. To the best of our knowledge, this work establishes an initial foundation for safer deployment of LLMs in healthcare.

pdf bib abs

Synthetic Doctor-Patient Dialogue Generation for Robust Medical ASR: A Scalable Pipeline for Vocabulary Expansion and Privacy Preservation
Kefei Liu | Meizhu Liu

Automatic Speech Recognition (ASR) is increasingly integral to healthcare services, where medical conversations present unique transcription challenges due to specialized terminology and frequent introduction of new terms. Existing ASR models, including widely used systems like Whisper, struggle with high word error rates (WER) on clinical vocabulary, especially medication names, primarily due to the scarcity of annotated audio-transcript data in the medical domain. This paper proposes and evaluates a novel synthetic data generation pipeline that produces comprehensive doctor-patient dialogues in both text and audio forms, specifically targeting a curated set of over 124,000 medical terms. The pipeline generated over 1 billion audios with ground truth transcriptions. Fine-tuning ASR models with this synthetic corpus significantly reduced overall WER and improved transcription accuracy on medical terms, marking a significant advance in healthcare ASR accuracy. Data generation code, dataset, and training and evaluation scripts are released.

pdf bib abs

Summarization of multi-party dialogues is a critical capability in industry, enhancing knowledge transfer and operational effectiveness across many domains. However, automatically generating high-quality summaries is challenging, as the ideal summary must satisfy a set of complex, multi-faceted requirements. While summarization has received immense attention in research, prior work has primarily utilized static datasets and benchmarks, a condition rare in practical scenarios where requirements inevitably evolve. In this work, we present an industry case study on developing an agentic system to summarize multi-party interactions. We share practical insights spanning the full development lifecycle to guide practitioners in building reliable, adaptable summarization systems, as well as to inform future research, covering: 1) robust methods for evaluation despite evolving requirements and task subjectivity, 2) component-wise optimization enabled by the task decomposition inherent in an agentic architecture, 3) the impact of upstream data bottlenecks, and 4) the realities of vendor lock-in due to the poor transferability of LLM prompts.

pdf bib abs

LingVarBench: Benchmarking LLMs on Entity Recognitions and Linguistic Verbalization Patterns in Phone-Call Transcripts
Seyedali Mohammadi | Manas Paldhe | Amit Chhabra | Youngseo Son | Vishal Seshagiri

We study structured entity extraction from phone-call transcripts in customer-support and healthcare settings, where annotation is costly, and data access is limited by privacy and consent. Existing methods degrade under disfluencies, interruptions, and speaker overlap, yet large real-call corpora are rarely shareable. We introduce LingVarBench, a benchmark and semantic synthetic data generation pipeline that generates linguistically varied training data via (1) LLM-sampled entity values, (2) curated linguistic verbalization patterns covering diverse disfluencies and entity-specific readout styles, and (3) a value–transcript consistency filter. Using this dataset, DSPy’s SIMBA automatically synthesizes and optimizes extraction prompts, reducing manual prompt engineering and targeting robustness to verbal variation. On real customer transcripts, prompts optimized solely on LingVarBench outperform zero-shot baselines and match or closely approach human-tuned prompts for structured entities such as ZIP code, date of birth, and name (F1 approximately 94-95 percent). For subjective questionnaire items, optimized prompts substantially improve over zero-shot performance and approach human-tuned prompts. LingVarBench offers a practical and cost-efficient path to deployment in a direct-answer setting, with real annotations later enabling additional refinement.

pdf bib abs

Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging
Alphaeus Dmonte | Vidhi Gupta | Daniel J Perry | Mark Arehart

Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.

pdf bib abs

This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM-based multi-agent systems. Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents’ states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves ~96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1–7 rounds. We also evaluate LLM-based defense methods, finding they detect some uncooperative behaviors, but some behaviors remain largely undetectable. These gaps highlight how uncooperative agents degrade collective outcomes and underscore the need for more resilient multi-agent systems.

pdf bib abs

Social media platforms have become primary sources for news consumption due to their real-time and interactive nature, yet they have also facilitated the widespread proliferation of misinformation, negatively impacting public health, social cohesion, and market stability. While professional fact-checking is essential for debunking rumors, the process is time-consuming, necessitating automation to effectively combat fake news. Existing approaches, such as extractive methods, often lack coherence and context, whereas abstractive methods leveraging large language models (LLMs) can generate more readable and informative debunking passages. However, readability alone is insufficient for effective misinformation correction; user acceptance is critical. Recent advancements in LLMs offer new opportunities for personalized debunking, as these models can generate context-sensitive responses and adapt to user profiles. Building on this, we propose the MUti-round Refinement and Simulated fEedback-enhanced framework (MURSE), which generates Chinese user-specific debunking passages by iteratively refining outputs based on simulated user feedback. Specifically, MURSE-generated user-specific debunking passages were preferred twice as often as general debunking passages in most cases, highlighting its potential to improve misinformation correction and foster positive dissemination chains.

pdf bib abs

Synthetic Data Fine-Tuning for Effective Team Formation in Enterprises
Guilherme Drummond Lima | Adriano Veloso

We evaluate the effectiveness of synthetic data fine-tuning for Semantic Search in a real-world Enterprise Team Formation problem scenario. In this problem, we aim to retrieve the best employee for a given task, given their information regarding abilities, experiences, and other aspects. We evaluate two synthetic data generation strategies: (1) augmenting real-world data with synthetic labels and (2) generating synthetic profiles for employees tailored to specific tasks. To measure the impact of these strategies, we fine-tune a pretrained text embedding model using LoRA and Rank Aggregation techniques. We evaluate the model performance against current SOTA algorithms on a human-curated dataset. Our experiments indicate that training a model that uses a combination of both Synthetic data generation strategies outperforms already established pre-trained models on the Team Formation task, improving the ranking metrics by an average of 30% in comparison to the best-performing pre-trained model.

pdf bib abs

Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents
Daud Waqas | Aaryamaan Golthi | Erika Hayashida | Huanzhi Mao

Multi-turn tool-calling LLMs — models capable of invoking external APIs or tools across several user turns — have emerged as a key feature in modern AI assistants, enabling extended dialogues from benign tasks to critical business, medical, and financial operations. Yet implementing multi-turn pipelines *remains difficult for many safety-critical industries* due to ongoing concerns regarding model resilience. While standardized benchmarks, such as the Berkeley Function-Calling Leaderboard (BFCL), have underpinned confidence concerning advanced function-calling models (like Salesforce’s xLAM V2), there is still a lack of visibility into multi-turn conversation-level robustness, especially given their exposure to real-world systems. In this paper, we introduce **Assertion-Conditioned Compliance (A-CC)**, a novel evaluation paradigm for multi-turn function-calling dialogues. A-CC provides holistic metrics that evaluate a model’s behavior when confronted with misleading assertions originating from two distinct vectors: (1) user-sourced assertions (USAs), which measure sycophancy toward plausible but misinformed user beliefs, and (2) function-sourced assertions (FSAs), which measure compliance with plausible but contradictory system policies (e.g., stale hints from unmaintained tools). Our results show that models are highly vulnerable to both USA sycophancy and FSA policy conflicts, confirming A-CC as a critical, latent vulnerability in deployed agents.

pdf bib abs

PROBES : Performance and Relevance Observation for BEtter Search
Sejal Jain | Cyrus Andre DSouza | Jitenkumar Babubhai Rana | Aniket Joshi | Promod Yenigalla

High-quality search is essential for the success of online platforms, spanning e-commerce, social media, shopping-focused applications, and broader search systems such as content discovery and enterprise web search. To ensure optimal user experience and drive business growth, continuous evaluation and improvement of search systems is crucial. This paper introduces PROBES, a novel multi-task system powered by Large Language Models (LLMs) designed for end-to-end evaluation of semantic search systems. PROBES identifies context-aware relevance using a fine-grained scale (exact, substitute, complement, irrelevant) by leveraging the query category, feature-level intent, and category-aware feature importance, enabling more precise and consistent judgments than relying solely on raw query text. This allows PROBES to provide differentiated relevance assessment across a diverse range of query categories. PROBES then dives deeper to understand the reason behind irrelevant results (Precision issues) by checking product content conflicts and inaccuracies. It also analyzes Missed Recall by leveraging retrieval and relevance models to determine whether a missed recall was due to a selection issue or a ranking/retrieval system issue. To evaluate PROBES, we introduce a new metric, the Actionable Error Rate (AER), defined as the proportion of actionable errors over all flagged errors. We observe that PROBES operates at an AER of 76%, generating actionable insights across 100 product categories.

Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds—crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio), by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

pdf bib abs

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages
Priyaranjan Pattnayak | Sanchari Chowdhuri

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce Indic Jailbreak Robustness (IJR) a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (~2.09B speakers), covering 45,216 prompts in JSON (contract-bound) and Free (naturalistic) tracks.IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach ~1.0 with refusals collapsing. (2) English→Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized/mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization ρ ≈ 0.28–0.32 indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize.

pdf bib abs

Synthesizing question answering data from financial documents: An End-to-End Multi-Agent Approach
Chetan Harsha | Karmvir Singh Phogat | Sridhar Dasaratha | Shashishekar Ramakrishna

Answering complex questions that require numerical reasoning over financial documents is challenging due to the diverse and scatterednature of relevant information. While large language models (LLMs) excel at financial reasoning, their enterprise deployment is often limited by cost and latency. Small language models (SLMs) present a cost-effective alternative but need to be fine-tuned with high-quality, domain-specific question-answer (QA) data. Acquiring such data requires manual expert annotation, presenting a bottleneck to the wider application of SLMs.This work introduces a modular, scalable end-to-end agentic pipeline that extracts and selects relevant content from unstructured financial documents and then generates QA pairs from the selected content for SLM fine-tuning. Compared to the same models trained on previous manually generated data for the task, one of the models trained on our pipeline-produced synthetic data achieved competitive in-distribution performance, and all tested models demonstrated superior generalization. The framework thus demonstrates considerable potential to accelerate the deployment of smaller, cost-effective models by reducing manual data creation efforts.

pdf bib abs

Toward Automatic Delegation Extraction in Japanese Law
Tsuyoshi Fujita | Yuya Sawada | Yusuke Sakai | Taro Watanabe

The legal systems have a hierarchical structure, and a higher-level law often authorizes a lower-level law to implement detailed provisions, which is called delegation. When interpreting legal texts with delegation, readers must repeatedly consult the lower-level laws that stipulate the detailed provisions, imposing a substantial workload. Therefore, it is necessary to develop a system that enables readers to instantly refer to relevant laws in delegation. However, manually annotating delegation is difficult because it requires extensive legal expertise, careful reading of numerous legal texts, and continuous adaptation to newly enacted laws. In this study, we focus on Japanese law and develop a two-stage pipeline system for automatic delegation annotation. First, we extract keywords that indicate delegation using a named entity recognition approach. Second, we identify the delegated provision corresponding to each keyword as an entity disambiguation task. In our experiments, the proposed system demonstrates sufficient performance to assist manual annotation in practice.

pdf bib abs

DIALECTIC: A Multi-Agent System for Startup Evaluation
Jae Yoon Bae | Simon Malberg | Joyce Ann Clarize Galang | Andre Retterath | Georg Groh

Venture capital (VC) investors face a large number of investment opportunities but only invest in few of these, with even fewer ending up successful. Early-stage screening of opportunities is often limited by investor bandwidth, demanding tradeoffs between evaluation diligence and number of opportunities assessed. To ease this tradeoff, we introduce DIALECTIC, an LLM-based multi-agent system for startup evaluation. DIALECTIC first gathers factual knowledge about a startup and organizes these facts into a hierarchical question tree. It then synthesizes the facts into natural-language arguments for and against an investment and iteratively critiques and refines these arguments through a simulated debate, which surfaces only the most convincing arguments. Our system also produces numeric decision scores that allow investors to rank and thus efficiently prioritize opportunities. We evaluate DIALECTIC through backtesting on real investment opportunities aggregated from five VC funds, showing that DIALECTIC matches the precision of human VCs in predicting startup success.

pdf bib abs

Legal documents have complex document layouts involving multiple nested sections, lengthy footnotes and further use specialized linguistic devices like intricate syntax and domain-specific vocabulary to ensure precision and authority. These inherent characteristics of legal documents make question answering challenging, and particularly so when the answer to the question spans several pages (i.e. requires long-context) and is required to be comprehensive (i.e. a long-form answer).In this paper, we address the challenges of long-context question answering in context of long-form answers given the idiosyncrasies of legal documents. We propose a question answering system that can (a) deconstruct domain-specific vocabulary for better retrieval from source documents, (b) parse complex document layouts while isolating sections and footnotes and linking them appropriately, (c) generate comprehensive answers using precise domain-specific vocabulary. We also introduce a coverage metric that classifies the performance into recall-based coverage categories allowing human users to evaluate the recall with ease. By leveraging the expertise of professionals from fields such as law and corporate tax, we curate a QA dataset. Through comprehensive experiments and ablation studies, we demonstrate the usability and merit of the proposed system.

pdf bib abs

We propose an efficient layer-specific optimization (ELO) method designed to enhance continual pretraining (CP) for specific languages in multilingual large language models (MLLMs). This approach addresses the common challenges of high computational cost and degradation of source language performance associated with traditional CP. The ELO method consists of two main stages: (1) ELO Pretraining, where a small subset of specific layers, identified in our experiments as the critically important first and last layers, are detached from the original MLLM and trained with the target language. This significantly reduces not only the number of trainable parameters but also the total parameters computed during the forward pass, minimizing GPU memory consumption and accelerating the training process. (2) Layer Alignment, where the newly trained layers are reintegrated into the original model, followed by a brief full fine-tuning step on a small dataset to align the parameters. Experimental results demonstrate that the ELO method achieves a training speedup of up to 6.46 times compared to existing methods, while improving target language performance by up to 6.2% on qualitative benchmarks and effectively preserving source language (English) capabilities.

pdf bib abs

MIRAGE: Metadata-guided Image Retrieval and Answer Generation for E-commerce Troubleshooting
Rishav Sahay | Lavanya Sita Tekumalla | Anoop Saladi

Existing multimodal systems typically associate text and available images based on embedding similarity or simple co-location, but such approaches often fail to ensure that the linked image accurately depicts the specific product or component mentioned in a troubleshooting instruction. We introduce MIRAGE, a metadata-first paradigm that treats structured metadata, (not raw pixels), as a first-class modality for multimodal grounding. In MIRAGE, both text and images are projected through a shared semantic schema capturing product attributes, context, and visual aspects, enabling reasoning over interpretable attributes for troubleshooting rather than unstructured embeddings. MIRAGE comprises of three complementary modules: M-Link for schema-guided image–text linking, M-Gen for metadata-conditioned multimodal generation, and M-Eval for consistency evaluation in the same structured space. Experiments on large-scale enterprise e-commerce troubleshooting data across 10 product types on 100K text chunks and 35K images show that metadata-centric grounding achieves over 40% higher linking coverage of high-quality visual content and over 45% in linking and response quality than embedding-based baselines. MIRAGE demonstrates the potential of structured metadata in enabling scalable, fine-grained grounding in multimodal troubleshooting systems.

pdf bib abs

Optimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi-Agent System), a framework that combines structured dialectic reasoning with domain-aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber-duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain-Specific Coding Agent (DCA) to generate architecture-aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock-gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi-agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.

pdf bib abs

D3: Dynamic Docid Decoding for Multi-Intent Generative Retrieval
Jaeyoung Kim | Dohyeon Lee | Soona Hong | Seung-won Hwang

Generative Retrieval (GR) maps queries to documents by generating discrete identifiers (DocIDs).However, offline DocID assignment and constrained decoding often prevent GR from capturing query-specific intent, especially when documents express multiple or unseen intents (i.e., intent misalignment).We introduce Dynamic Docid Decoding (D3), an inference-time mechanism that adaptively refines DocIDs through delayed, query-informed identifier expansion.D3 uses (a) verification to detect intent misalignment and (b) dynamic decoding to extend DocIDs with query-aligned tokens, even those absent from the pre-indexed vocabulary, enabling plug-and-play DocID expansion beyond the static vocabulary while adding minimal overhead.Experiments on NQ320k and MS-MARCO show that D3 consistently improves retrieval accuracy, especially on unseen and multi-intent documents, across various GR models, including a +2.4%p nDCG@10 gain on the state-of-the-art model.

pdf bib abs

DisGraph-RP: Graph-Augmented Temporal Modeling with Aspect-Based Contrastive Encoding of Discharge Summary for Readmission Prediction
Sudeshna Jana | Tirthankar Dasgupta | Manjira Sinha | Pabitra Mitra

Predicting hospital readmissions is a critical clinical task with substantial implications for patient outcomes and healthcare cost management. We propose DisGraph-RP, a graph-augmented temporal modeling framework that integrates structured discourse-aware text representation with cross-admission relational reasoning. Our approach introduces a Section-Aware Contrastive Encoder that leverages section segmentation and aspect-based supervision to produce fine-grained representations of discharge summaries. These representations are then composed over time using a Graph-Based temporal module that encodes inter-visit dependencies through learned edge relations, enabling the model to capture disease progression, treatment history, and recurrent risk signals. Experiments on multiple real-world datasets demonstrate that DisGraph-RP achieves significant improvements over strong baselines, including transformer-based clinical models and prompting-based LLM approaches. Our findings highlight the importance of combining discourse-informed text encoding with temporal graph reasoning for robust clinical outcome prediction.

pdf bib abs

CareerPathKG: Knowledge Graph Integrated Framework for Career Intelligence
Ngoc-Quang Le | Duc Duong Hoang | Mai Vu Tran | Thi-Hai-Yen Vuong

The labor market is experiencing rapid and continual shifts in required skills and competencies, driven by technological advancement and evolving industry structures. Within this dynamic environment, candidates increasingly face challenges in orienting their career development, requiring them to continuously update their knowledge and capabilities to meet contemporary job requirements; this need is particularly necessary for new entrants to the labor market, who must cultivate a comprehensive understanding of current labor-market conditions. To address these issues, this study proposes an enterprise recruitment framework grounded in a career path knowledge graph, capturing occupations, skill requirements, and career transitions using standardized taxonomies enriched with job-posting data. The framework integrates transformer-based embeddings, large language models, and knowledge-graph reasoning to support efficient and reliable CV assessment, CV-JD matching and career guidance.

pdf bib abs

A Hybrid Supervised-LLM Pipeline for Actionable Suggestion Mining in Unstructured Customer Reviews
Aakash Trivedi | Aniket Upadhyay | Pratik Narang | Dhruv Kumar | Praveen Kumar

Extracting actionable suggestions from customer reviews is essential for operational decision-making, yet these directives are often embedded within mixed-intent, unstructured text. Existing approaches either classify suggestion-bearing sentences or generate high-level summaries, but rarely isolate the precise improvement instructions businesses need.We evaluate a hybrid pipeline combining a high-recall RoBERTa classifier trained with a precision–recall surrogate to reduce unrecoverable false negatives with a controlled, instruction-tuned LLM for suggestion extraction, categorization, clustering, and summarization. Across real-world hospitality and food datasets, the hybrid system outperforms prompt-only, rule-based, and classifier-only baselines in extraction accuracy and cluster coherence. Human evaluations further confirm that the resulting suggestions and summaries are clear, faithful, and interpretable.Overall, our results show that hybrid reasoning architectures achieve meaningful improvements fine-grained actionable suggestion mining while highlighting challenges in domain adaptation and efficient local deployment.

pdf bib abs

Personalized shopping agents must adapt their decisions to different user personas, balancing efficiency, preference alignment, and goal success. Building upon the WebShop dataset and 𝜏²-Bench environment, ShopperBench introduces a persona-guided benchmark for evaluating such adaptive behaviors. ShopperBench augments shopping trajectories with persona-conditioned goals, reasoning rationales, and preference cues, capturing how diverse shopper types—from price-conscious planners to trend-seeking explorers—navigate product search and selection. We further design a baseline of ShopperAgents that operate under persona guidance to simulate realistic, goal-oriented shopping interactions. To evaluate these agents, we propose new metrics including Persona Fidelity, Persona-Query Alignment, and Path Consistency. Together, Our ShopperBench provides a testbed for studying personalized and context-aware shopping intelligence, bridging the gap between human-centered e-commerce behavior and agent-based simulation.

pdf bib abs

ARQA: A Benchmark for Grounded Table–Text QA in Enterprise Annual Reports
Ruilong Wang | Simone Balloccu

Annual reports communicate corporate performance to stakeholders through dense tables and explanatory text, with rich grounding signals making automated reasoning challenging. Existing QA benchmarks focus on retrieval or single-modality reasoning and rarely require justification for answers with both textual and tabular evidence. We introduce ARQA (Annual Report QA), a benchmark of ~2.5K QA pairs spanning ten fiscal years of automotive enterprise annual reports and three reasoning families — Lookup, Arithmetic, and Insight. Data are produced via a planner–generator pipeline, deterministically verified and recomputed, and fully reviewed by domain experts. We evaluate state-of-the-art instruction-tuned language models on ARQA, showing strong factual retrieval but persistent weaknesses in grounded arithmetic and causal reasoning. We release ARQA and its evaluation toolkit to facilitate research on auditable, evidence-first reasoning over enterprise documents. (https://github.com/RuilongWang/ARQA-Benchmark/)

pdf bib abs

Clinical Question-Answering (CQA) industry systems are increasingly rely on Large Language Models (LLMs), yet their deployment is often guided by the assumption that domain-specific fine-tuning is essential. Although specialised medical LLMs such as BioBERT, BioGPT, and PubMedBERT remain popular, they face practical limitations including narrow coverage, high retraining costs, and limited adaptability. Efforts based on Supervised Fine-Tuning (SFT) have attempted to address these assumptions but continue to reinforce what we term the SPECIALISATION FALLACY—the belief that specialised medical LLMs are inherently superior for CQA. To address this assumption, we introduce MEDASSESS-X, a deployment-industry-oriented CQA framework that applies alignment at inference time rather than through SFT. MEDASSESS-X uses lightweight steering vectors to guide model activations toward medically consistent reasoning without updating model weights or requiring domain-specific retraining. This inference-time alignment layer stabilises CQA performance across both general-purpose and specialised medical LLMs, thereby resolving the SPECIALISATION FALLACY. Empirically, MEDASSESS-X delivers consistent gains across all LLM families, improving Accuracy by up to +6%, Factual Consistency by +7%, and reducing Safety Error Rate by as much as 50%.

pdf bib abs

SkiLLens: Recognising and Mapping Novel Skills from Millions of Job Ads Across Europe Using Language Models
Alessia De Santo | Lorenzo Malandri | Fabio Mercorio | Mario Mezzanzanica | Navid Nobani

In a rapidly evolving labor market, detecting and addressing emerging skill needs is essential for shaping responsive education and workforce policies. Online job advertisements (OJAs) provide a real-time view of changing demands, but require first retrieving skill mentions from unstructured text and then solving the entity linking problem of connecting them to standardized skill taxonomies. To harness this potential, we present a multilingual human-in-the-loop (HITL) pipeline that operates in two steps: candidate skills are extracted from national OJA corpora using country-specific word embeddings, capturing terms that reflect each country’s labor market. These candidates are linked to ESCO using an encoder-based system and refined through a decoder large language models (LLMs) for accurate contextual alignment. Our approach is validated through both quantitative and qualitative evaluations, demonstrating that our method enables timely, multilingual monitoring of emerging skills, supporting agile policy-making and targeted training initiatives.

pdf bib abs

Register-Transfer Level (RTL) synthesis and summarization are central to hardware design automation but remain challenging for Large Language Models (LLMs) due to rigid HDL syntax, limited supervision, and weak alignment with natural language. Existing prompting and retrieval-augmented generation (RAG) methods have not incorporated symbolic planning, limiting their structural precision. We introduce SYMDIREC, a neuro-symbolic framework that decomposes RTL tasks into symbolic subgoals, retrieves relevant code via a fine-tuned retriever, and assembles verified outputs through LLM reasoning. Supporting both Verilog and VHDL without LLM fine-tuning, SYMDIREC achieves ~20% higher Pass@1 rates for synthesis and 15–20% ROUGE-L improvements for summarization over prompting and RAG baselines, demonstrating the benefits of symbolic guidance in RTL tasks.

pdf bib abs

Benchmarking and Mitigating the Impact of Noisy User Prompts in Medical VLMs via Cross-Modal Reflection
Zhiyu Xue | Reza Abbasi-Asl | Ramtin Pedarsani

Medical vision-language models (Med-VLMs) offer a new and effective paradigm for digital health in tasks such as disease diagnosis using clinical images and text. In these tasks, an important but underexplored research question is how Med-VLMs interpret and respond to user-provided clinical information, especially when the prompts are noisy. For a systematic evaluation, we construct Med-CP, a large-scale visual question answering (VQA) benchmark designed to comprehensively evaluate the influence of clinical prompts across diverse modalities, anatomical regions, and diagnostic tasks. Our experiments reveal that existing Med-VLMs tend to follow user-provided prompts blindly, regardless of whether they are accurate or not, raising concerns about their reliability in real-world interactions. To address this problem, we introduce a novel supervised fine-tuning (SFT) approach for Med-VLMs based on cross-modal reflection chain-of-thought (CoT) across medical images and text. In our SFT method, the Med-VLM is trained to produce reasoning paths for the analysis of the medical image and the user-provided prompt. Then, the final answer is determined by conducting a reflection on the visual and textual information. Experimental results demonstrate that our method considerably enhances the robustness against noisy user-provided prompts for both in-domain and out-of-domain evaluation scenarios.

pdf bib abs

Lightweight Domain-Specific Language Model for Real-Time Structuring of Medical Prescriptions
Jonathan Pattin Cottet | Véronique Eglin | Alex Aussem

Automated structuring of medical prescriptions is critical for downstream safety checks in pharmacies, yet remains challenging due to heterogeneous layouts, OCR noise, and dense clinical abbreviations in real-world documents. Existing language models either ignore layout information, rely on computationally expensive image-based architectures, or cannot operate under strict privacy and hardware constraints such as GDPR and HDS-certified environments.We present a lightweight (<10M parameters), privacy-preserving transformer specifically designed for Entity Extraction (EE) and Entity Linking (EL) in French medical prescriptions. The model uses only OCR text and normalized 2D word coordinates, enabling robust pseudonymisation and real-time CPU-level inference while preserving essential spatial cues. It is pretrained on a large corpus of pseudonymised OCR outputs using objectives tailored to prescription structure, including a novel Token-to-Line Alignment (TLA) task, and fine-tuned on the Rx-PAD dataset (Pattin Cottet et al., 2025).Empirical results show that our approach matches or surpasses larger document-understanding models and rivals multimodal LLMs on strict extraction metrics, while achieving sub-second latency suitable for operational deployment. The system is currently used in 230 pharmacies, demonstrating both scalability and practical relevance. These findings highlight the importance of specialized, domain-aware, lightweight models for safe, efficient, and legally compliant prescription verification.

pdf bib abs

Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. Common metrics used for this choice, such as Accuracy, Precision, and F1, are sensitive to class imbalance and to arbitrary choices of positive class, and can favor judges that distort prevalence estimates. We show that Youden’s J statistic is theoretically aligned with choosing the best judge to compare models, and that Balanced Accuracy is an equivalent linear transformation of J. Through both analytical arguments and empirical examples and simulations, we demonstrate how selecting judges using Balanced Accuracy leads to better, more robust classifier selection.

pdf bib abs

PharmaQA.IT: an Italian dataset for Q A in the pharmaceutical domain
Kamyar Zeinalipour | Andrea Zugarini | Asya Zanollo | Leonardo Rigutini

The growing use of Large Language Models (LLMs) for medical Question Answering (QA) requires reliable, evidence-grounded benchmarks beyond English. In Italy, Riassunti delle Caratteristiche del Prodotto (RCP) issued by the Italian Medicines Agency (AIFA) are the main regulatory source on medicines, yet no QA dataset exists on these documents, limiting the development and evaluation of trustworthy Italian QA systems.We introduce PharmaQA.IT, an Italian extractive QA dataset built from RCPs in PharmaER.IT. Using a semi-automatic pipeline, we (i) select informative pages from 1,077 leaflets, (ii) prompt a multimodal LLM on page images with professional personas to generate candidate question–answer pairs, and (iii) validate and normalise them with expert revision. The final dataset contains 861 high-quality question–answer pairs on indications, contraindications, dosage, warnings, interactions, and pharmacological properties.We frame PharmaQA.IT as an extractive QA benchmark with structured JSON outputs and evaluate a range of open and proprietary LLMs. Results show that open models approach closed-source performance under a chunking-and-retrieval setup. PharmaQA.IT, together with all code, prompts, and evaluation scripts, will be publicly released to support research on trustworthy Italian biomedical QA.PharmaQA.IT, together with all code, prompts, and evaluation scripts, is publicly available on Hugging Face to support research on trustworthy Italian biomedical QA.

pdf bib abs

DIRECT: Directional Relevance in Conversational Trajectories
Anshuman Mourya | Rajdeep Mukherjee | Prerna Jolly | Vinayak S Puranik | Sivaramakrishnan R Kaveri

Conversational Agents have become ubiquitous across application domains, such as, shopping assistants, medical diagnosis, autonomous task planning etc. Users interacting with these agents often fail to understand how to start a conversation or what to ask next to obtain the desired information. To enable seamless and hassle-free user-agent interactions, we introduce Next Question Suggestions (NQS), which are essentially highly relevant follow-up question recommendations that act as conversation starters or discover-ability tools to capture non-trivial user intents, leading to more engaging conversations. Relying on LLMs for both response as well as NQS generation is a costly ask in latency-constrained commercial settings, with an added risk of handling potentially unsafe or unanswerable generated queries. A key component of building an efficient low-latency NQS experience is, therefore, retrieval (or embedding) models that fetch the most-relevant candidate questions from an offline pre-curated Question Bank (QB). Off-the-shelf embedding models cannot capture domain-specific nuances and more importantly the directionality inherent in follow-up question recommendations. In this work, we propose an end-to-end retrieval system, DIRECT that is optimized to model directional relevance. Given a user query, it produces a ranked list of highly relevant follow-up question recommendations within 1 sec. Our system also contains an LLM-as-a-judge component, tuned on proprietary user-agent interaction logs, to evaluate the end-to-end performance in terms of CTR.