2025
pdf
bib
abs
Group Preference Alignment: Customizing LLM Responses from In-Situ Conversations Only When Needed
Ishani Mondal
|
Jack W. Stokes
|
Sujay Kumar Jauhar
|
Longqi Yang
|
Mengting Wan
|
Xiaofeng Xu
|
Xia Song
|
Jordan Lee Boyd-Graber
|
Jennifer Neville
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
LLMs often fail to meet specialized needs of distinct user groups due to their one-size-fits-all approach, and there is limited understanding of what personalization each group expects.To address this, we propose GPA a group-aware personalization framework that captures context-specific preference variations and steers LLMs accordingly.Our approach involves: (1) Group-Aware Preference Extraction, which distills divergent preferences from real-world conversation logs into interpretable rubrics, and (2) Tailored Response Generation, using (a) GPA-CT, which adapts responses using learnt rubrics, and (b) GPA-FT, which finetunes models using rubric-guided synthetic data.Automatic and Human evaluations confirm that GPA improves group alignment without compromising perfomance on standard instruction-following benchmarks.
pdf
bib
abs
GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation
Jie He
|
Jennifer Neville
|
Mengting Wan
|
Longqi Yang
|
Hui Liu
|
Xiaofeng Xu
|
Xia Song
|
Jeff Z. Pan
|
Pei Zhou
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during supervised fine-tuning (SFT), questions remain about their ability to develop robust tool-usage skills and can effectively generalize to unseen queries and tools. In this work, we present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization. Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization, enabling the model to address queries initially lacking a suitable tool by adopting and utilizing one when it becomes available, and Weak-to-Strong Generalization, allowing models to leverage enhanced versions of existing tools to solve queries. To achieve this, we develop synthetic training data simulating these two dimensions of tool usage and introduce a two-stage fine-tuning approach: optimizing tool ranking, then refining tool selection. Through extensive experiments across four generalization scenarios, we demonstrate that our method significantly enhances the tool-usage capabilities of LLMs ranging from 1B to 8B parameters, achieving performance that surpasses GPT-4o. Furthermore, our analysis also provides valuable insights into the challenges LLMs encounter in tool generalization.
pdf
bib
abs
Teaching Language Models To Gather Information Proactively
Tenghao Huang
|
Sihao Chen
|
Muhao Chen
|
Jonathan May
|
Longqi Yang
|
Mengting Wan
|
Pei Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) are increasingly expected to function as collaborative partners, engaging in back-and-forth dialogue to solve complex, ambiguous problems. However, current LLMs often falter in real-world settings, defaulting to passive responses or narrow clarifications when faced with incomplete or under-specified prompts—falling short of proactively gathering the missing information that is crucial for high-quality solutions. In this work, we introduce a new task paradigm: proactive information gathering, where LLMs must identify gaps in the provided context and strategically elicit implicit user knowledge through targeted questions. To systematically study and train this capability, we design a scalable framework that generates partially specified, real-world tasks, masking key information and simulating authentic ambiguity. Within this setup, our core innovation is a reinforcement finetuning strategy rewards questions that elicit genuinely new, implicit user information—such as hidden domain expertise or fine-grained requirements—that would otherwise remain unspoken. Experiments demonstrate that our trained Qwen-2.5-7B model significantly outperforms o3-mini by 18% on automatic evaluation metrics. More importantly, human evaluation reveals that clarification questions and final outlines generated by our model are favored by human annotators by 42% and 28% respectively. Together, these results highlight the value of proactive clarification in elevating LLMs from passive text generators to genuinely collaborative thought partners.
2024
pdf
bib
abs
Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models
Ying-Chun Lin
|
Jennifer Neville
|
Jack Stokes
|
Longqi Yang
|
Tara Safavi
|
Mengting Wan
|
Scott Counts
|
Siddharth Suri
|
Reid Andersen
|
Xiaofeng Xu
|
Deepak Gupta
|
Sujay Kumar Jauhar
|
Xia Song
|
Georg Buscher
|
Saurabh Tiwary
|
Brent Hecht
|
Jaime Teevan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Accurate and interpretable user satisfaction estimation (USE) is critical for understanding, evaluating, and continuously improving conversational systems. Users express their satisfaction or dissatisfaction with diverse conversational patterns in both general-purpose (ChatGPT and Bing Copilot) and task-oriented (customer service chatbot) conversational systems. Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns and are hard to interpret. In this work, we show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches. Moreover, an LLM can be tailored for USE via an iterative prompting framework using supervision from labeled examples. Our proposed method, Supervised Prompting for User satisfaction Rubrics (SPUR), not only has higher accuracy but is more interpretable as it scores user satisfaction via learned rubrics with a detailed breakdown.
pdf
bib
abs
Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers
Sheshera Mysore
|
Zhuoran Lu
|
Mengting Wan
|
Longqi Yang
|
Bahareh Sarrafzadeh
|
Steve Menezes
|
Tina Baghaee
|
Emmanuel Barajas Gonzalez
|
Jennifer Neville
|
Tara Safavi
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Powerful large language models have facilitated the development of writing assistants that promise to significantly improve the quality and efficiency of composition and communication. However, a barrier to effective assistance is the lack of personalization in LLM outputs to the author’s communication style, specialized knowledge, and values. In this paper, we address this challenge by proposing Pearl, a LLM writing assistant personalized with a retriever that is trained to be generation-calibrated for personalization. Generation calibration ensures that our retriever selects historic user authored documents to augment an LLM prompt such that they are likely to help an LLM generation better adhere to a users’ preferences. We propose two key novelties for training such a retriever: (1) A training data selection method that identifies user requests likely to benefit from personalization and documents that provide that benefit; and (2) A scale-calibrating KL-divergence objective that ensures that our retriever scores remain proportional to the downstream generation quality from using the document for personalized generation. In a series of holistic evaluations, we demonstrate the effectiveness of Pearl in generating long-form texts on multiple social media datasets. Finally, we demonstrate how a generation-calibrated retriever can double as a performance predictor – detecting low quality retrieval, and improving potentially under-performing outputs via revision with LLMs.
pdf
bib
abs
S3-DST: Structured Open-Domain Dialogue Segmentation and State Tracking in the Era of LLMs
Sarkar Snigdha Sarathi Das
|
Chirag Shah
|
Mengting Wan
|
Jennifer Neville
|
Longqi Yang
|
Reid Andersen
|
Georg Buscher
|
Tara Safavi
Findings of the Association for Computational Linguistics: ACL 2024
Traditional Dialogue State Tracking (DST) has focused on tracking preferences and intents in conversations centered around specific tasks (e.g. booking services). These conventional systems assume a relatively restricted conversation flow in which each turn gradually offers new information. However, advancements in Large Language Models (LLMs) have ushered in more versatile open-domain chat systems in which extended dialogue sessions encompassing numerous tasks and topics are common—in turn requiring new conversational tracking tools in order to successfully orchestrate such systems. Addressing these challenges, we introduce a novel approach combining dialogue segmentation and state tracking within open-domain dialogues, tailored for zero-shot applications appropriate to a true open-domain dialogue system. Our proposed method S3-DST employs a unique structured prompting technique and *Pre-Analytical Recollection*, a novel grounding mechanism we designed for improving long context tracking. Tested on proprietary anonymized open-domain dialogue datasets as well as publicly available DST and segmentation datasets, S3-DST consistently outperforms the state-of-the-art, showcasing its effectiveness and adaptability state tracking in the next wave of LLM-based chat systems. We also release S3-DST annotations with GPT-4 on a curated subset of LMSYS-Chat-1M to be used as a testbed to fuel research in this direction.