Longqi Yang
2026
ProMediate: A Simulation Testbed for Evaluating Proactive Mediation in Multi-Party Negotiation
Ziyi Liu | Bahareh Sarrafzadeh | Pei Zhou | Longqi Yang | Jieyu Zhao | Ashish Sharma
Findings of the Association for Computational Linguistics: ACL 2026
Ziyi Liu | Bahareh Sarrafzadeh | Pei Zhou | Longqi Yang | Jieyu Zhao | Ashish Sharma
Findings of the Association for Computational Linguistics: ACL 2026
While LLMs increasingly assist individual users, there is a critical need for agents that can proactively manage complex, multi-party collaboration. However, the scarcity of systematic evaluation methods for these group dynamics limits the development of AI capable of effectively supporting teams Here, we present ProMediate, the first testbed for evaluating proactive AI mediator agents in complex, multi-topic, multi-party negotiations. ProMediate consists of two core components: (i) a simulation environment based on realistic negotiation cases with a plug-and-play proactive AI mediator, capable of flexibly deciding when and how to intervene; and (ii) a socio-cognitive evaluation framework with a new suite of metrics to measure consensus changes, intervention latency, mediator effectiveness, and intelligence. These components establish a systematic framework for assessing the capability of proactive AI agents in multi-party settings. Our results show that a socially intelligent mediator agent outperforms a generic baseline, via faster, better-targeted interventions. In the ProMediate-Hard setting, our social mediator increases consensus change by 3.6 percentage points compared to the generic baseline (10.65% vs 7.01%) while being 77% faster in response (15.98s vs. 3.71s). In conclusion, ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents.
WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback
Taiwei Shi | Zhuoer Wang | Longqi Yang | Ying-Chun Lin | Zexue He | Mengting Wan | Pei Zhou | Sujay Kumar Jauhar | Sihao Chen | Shan Xia | Hongfei Zhang | Jieyu Zhao | Xiaofeng Xu | Xia Song | Jennifer Neville
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Taiwei Shi | Zhuoer Wang | Longqi Yang | Ying-Chun Lin | Zexue He | Mengting Wan | Pei Zhou | Sujay Kumar Jauhar | Sihao Chen | Shan Xia | Hongfei Zhang | Jieyu Zhao | Xiaofeng Xu | Xia Song | Jennifer Neville
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As large language models (LLMs) continue to advance, aligning these models with human preferences has emerged as a critical challenge. Traditional alignment methods, relying on human or LLM annotated datasets, are limited by their resource-intensive nature, inherent subjectivity, misalignment with real-world user preferences, and the risk of feedback loops that amplify model biases. To overcome these limitations, we introduce WildFeedback, a novel framework that leverages in-situ user feedback during conversations with LLMs to create preference datasets automatically. Given a corpus of multi-turn user-LLM conversation, WildFeedback identifies and classifies user feedback to LLM responses between conversation turns. The user feedback is then used to create examples of preferred and dispreferred responses according to users’ preference. Our experiments demonstrate that LLMs fine-tuned on WildFeedback dataset exhibit significantly improved alignment with user preferences, as evidenced by both traditional benchmarks and our proposed checklist-guided evaluation. By incorporating in-situ feedback from actual users, WildFeedback addresses the scalability, subjectivity, and bias challenges that plague existing approaches, marking a significant step toward developing LLMs that are more responsive to the diverse and evolving needs of their users.
2025
GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation
Jie He | Jennifer Neville | Mengting Wan | Longqi Yang | Hui Liu | Xiaofeng Xu | Xia Song | Jeff Z. Pan | Pei Zhou
Findings of the Association for Computational Linguistics: ACL 2025
Jie He | Jennifer Neville | Mengting Wan | Longqi Yang | Hui Liu | Xiaofeng Xu | Xia Song | Jeff Z. Pan | Pei Zhou
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during supervised fine-tuning (SFT), questions remain about their ability to develop robust tool-usage skills and can effectively generalize to unseen queries and tools. In this work, we present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization. Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization, enabling the model to address queries initially lacking a suitable tool by adopting and utilizing one when it becomes available, and Weak-to-Strong Generalization, allowing models to leverage enhanced versions of existing tools to solve queries. To achieve this, we develop synthetic training data simulating these two dimensions of tool usage and introduce a two-stage fine-tuning approach: optimizing tool ranking, then refining tool selection. Through extensive experiments across four generalization scenarios, we demonstrate that our method significantly enhances the tool-usage capabilities of LLMs ranging from 1B to 8B parameters, achieving performance that surpasses GPT-4o. Furthermore, our analysis also provides valuable insights into the challenges LLMs encounter in tool generalization.
Group Preference Alignment: Customizing LLM Responses from In-Situ Conversations Only When Needed
Ishani Mondal | Jack W. Stokes | Sujay Kumar Jauhar | Longqi Yang | Mengting Wan | Xiaofeng Xu | Xia Song | Jordan Lee Boyd-Graber | Jennifer Neville
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Ishani Mondal | Jack W. Stokes | Sujay Kumar Jauhar | Longqi Yang | Mengting Wan | Xiaofeng Xu | Xia Song | Jordan Lee Boyd-Graber | Jennifer Neville
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
LLMs often fail to meet specialized needs of distinct user groups due to their one-size-fits-all approach, and there is limited understanding of what personalization each group expects.To address this, we propose GPA a group-aware personalization framework that captures context-specific preference variations and steers LLMs accordingly.Our approach involves: (1) Group-Aware Preference Extraction, which distills divergent preferences from real-world conversation logs into interpretable rubrics, and (2) Tailored Response Generation, using (a) GPA-CT, which adapts responses using learnt rubrics, and (b) GPA-FT, which finetunes models using rubric-guided synthetic data.Automatic and Human evaluations confirm that GPA improves group alignment without compromising perfomance on standard instruction-following benchmarks.
Teaching Language Models To Gather Information Proactively
Tenghao Huang | Sihao Chen | Muhao Chen | Jonathan May | Longqi Yang | Mengting Wan | Pei Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
Tenghao Huang | Sihao Chen | Muhao Chen | Jonathan May | Longqi Yang | Mengting Wan | Pei Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) are increasingly expected to function as collaborative partners, engaging in back-and-forth dialogue to solve complex, ambiguous problems. However, current LLMs often falter in real-world settings, defaulting to passive responses or narrow clarifications when faced with incomplete or under-specified prompts—falling short of proactively gathering the missing information that is crucial for high-quality solutions. In this work, we introduce a new task paradigm: proactive information gathering, where LLMs must identify gaps in the provided context and strategically elicit implicit user knowledge through targeted questions. To systematically study and train this capability, we design a scalable framework that generates partially specified, real-world tasks, masking key information and simulating authentic ambiguity. Within this setup, our core innovation is a reinforcement finetuning strategy rewards questions that elicit genuinely new, implicit user information—such as hidden domain expertise or fine-grained requirements—that would otherwise remain unspoken. Experiments demonstrate that our trained Qwen-2.5-7B model significantly outperforms o3-mini by 18% on automatic evaluation metrics. More importantly, human evaluation reveals that clarification questions and final outlines generated by our model are favored by human annotators by 42% and 28% respectively. Together, these results highlight the value of proactive clarification in elevating LLMs from passive text generators to genuinely collaborative thought partners.
2024
S3-DST: Structured Open-Domain Dialogue Segmentation and State Tracking in the Era of LLMs
Sarkar Snigdha Sarathi Das | Chirag Shah | Mengting Wan | Jennifer Neville | Longqi Yang | Reid Andersen | Georg Buscher | Tara Safavi
Findings of the Association for Computational Linguistics: ACL 2024
Sarkar Snigdha Sarathi Das | Chirag Shah | Mengting Wan | Jennifer Neville | Longqi Yang | Reid Andersen | Georg Buscher | Tara Safavi
Findings of the Association for Computational Linguistics: ACL 2024
Traditional Dialogue State Tracking (DST) has focused on tracking preferences and intents in conversations centered around specific tasks (e.g. booking services). These conventional systems assume a relatively restricted conversation flow in which each turn gradually offers new information. However, advancements in Large Language Models (LLMs) have ushered in more versatile open-domain chat systems in which extended dialogue sessions encompassing numerous tasks and topics are common—in turn requiring new conversational tracking tools in order to successfully orchestrate such systems. Addressing these challenges, we introduce a novel approach combining dialogue segmentation and state tracking within open-domain dialogues, tailored for zero-shot applications appropriate to a true open-domain dialogue system. Our proposed method S3-DST employs a unique structured prompting technique and *Pre-Analytical Recollection*, a novel grounding mechanism we designed for improving long context tracking. Tested on proprietary anonymized open-domain dialogue datasets as well as publicly available DST and segmentation datasets, S3-DST consistently outperforms the state-of-the-art, showcasing its effectiveness and adaptability state tracking in the next wave of LLM-based chat systems. We also release S3-DST annotations with GPT-4 on a curated subset of LMSYS-Chat-1M to be used as a testbed to fuel research in this direction.
Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models
Ying-Chun Lin | Jennifer Neville | Jack Stokes | Longqi Yang | Tara Safavi | Mengting Wan | Scott Counts | Siddharth Suri | Reid Andersen | Xiaofeng Xu | Deepak Gupta | Sujay Kumar Jauhar | Xia Song | Georg Buscher | Saurabh Tiwary | Brent Hecht | Jaime Teevan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ying-Chun Lin | Jennifer Neville | Jack Stokes | Longqi Yang | Tara Safavi | Mengting Wan | Scott Counts | Siddharth Suri | Reid Andersen | Xiaofeng Xu | Deepak Gupta | Sujay Kumar Jauhar | Xia Song | Georg Buscher | Saurabh Tiwary | Brent Hecht | Jaime Teevan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Accurate and interpretable user satisfaction estimation (USE) is critical for understanding, evaluating, and continuously improving conversational systems. Users express their satisfaction or dissatisfaction with diverse conversational patterns in both general-purpose (ChatGPT and Bing Copilot) and task-oriented (customer service chatbot) conversational systems. Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns and are hard to interpret. In this work, we show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches. Moreover, an LLM can be tailored for USE via an iterative prompting framework using supervision from labeled examples. Our proposed method, Supervised Prompting for User satisfaction Rubrics (SPUR), not only has higher accuracy but is more interpretable as it scores user satisfaction via learned rubrics with a detailed breakdown.
Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers
Sheshera Mysore | Zhuoran Lu | Mengting Wan | Longqi Yang | Bahareh Sarrafzadeh | Steve Menezes | Tina Baghaee | Emmanuel Barajas Gonzalez | Jennifer Neville | Tara Safavi
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Sheshera Mysore | Zhuoran Lu | Mengting Wan | Longqi Yang | Bahareh Sarrafzadeh | Steve Menezes | Tina Baghaee | Emmanuel Barajas Gonzalez | Jennifer Neville | Tara Safavi
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Powerful large language models have facilitated the development of writing assistants that promise to significantly improve the quality and efficiency of composition and communication. However, a barrier to effective assistance is the lack of personalization in LLM outputs to the author’s communication style, specialized knowledge, and values. In this paper, we address this challenge by proposing Pearl, a LLM writing assistant personalized with a retriever that is trained to be generation-calibrated for personalization. Generation calibration ensures that our retriever selects historic user authored documents to augment an LLM prompt such that they are likely to help an LLM generation better adhere to a users’ preferences. We propose two key novelties for training such a retriever: (1) A training data selection method that identifies user requests likely to benefit from personalization and documents that provide that benefit; and (2) A scale-calibrating KL-divergence objective that ensures that our retriever scores remain proportional to the downstream generation quality from using the document for personalized generation. In a series of holistic evaluations, we demonstrate the effectiveness of Pearl in generating long-form texts on multiple social media datasets. Finally, we demonstrate how a generation-calibrated retriever can double as a performance predictor – detecting low quality retrieval, and improving potentially under-performing outputs via revision with LLMs.
Search
Fix author
Co-authors
- Mengting Wan 7
- Jennifer Neville 6
- Xia Song 4
- Xiaofeng Xu 4
- Pei Zhou 4
- Sujay Kumar Jauhar 3
- Tara Safavi 3
- Reid Andersen 2
- Georg Buscher 2
- Sihao Chen 2
- Ying-Chun Lin 2
- Bahareh Sarrafzadeh 2
- Jieyu Zhao 2
- Tina Baghaee 1
- Jordan Lee Boyd-Graber 1
- Muhao Chen 1
- Scott Counts 1
- Sarkar Snigdha Sarathi Das 1
- Emmanuel Barajas Gonzalez 1
- Deepak Gupta 1
- Jie He 1
- Zexue He 1
- Brent Hecht 1
- Tenghao Huang 1
- Hui Liu 1
- Ziyi Liu 1
- Zhuoran Lu 1
- Jonathan May 1
- Steve Menezes 1
- Ishani Mondal 1
- Sheshera Mysore 1
- Jeff Z. Pan 1
- Chirag Shah 1
- Ashish Sharma 1
- Taiwei Shi 1
- Jack W. Stokes 1
- Jack Stokes 1
- Siddharth Suri 1
- Jaime Teevan 1
- Saurabh Tiwary 1
- Zhuoer Wang 1
- Shan Xia 1
- Hongfei Zhang 1