Varsha Kishore
2026
Language Models Don’t Know What You Want: Evaluating Personalization in Deep Research Needs Real Users
Nishant Balepur | Malachi Hamada | Varsha Kishore | Sergey Feldman | Amanpreet Singh | Pao Siangliulue | Joseph Chee Chang | Eunsol Choi | Jordan Lee Boyd-Graber | Aakanksha Naik
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Nishant Balepur | Malachi Hamada | Varsha Kishore | Sergey Feldman | Amanpreet Singh | Pao Siangliulue | Joseph Chee Chang | Eunsol Choi | Jordan Lee Boyd-Graber | Aakanksha Naik
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers’ queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user’s research interests; 2) proposes personalized actions for a user’s input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP’s standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.
2024
Diffusion Guided Language Modeling
Justin Lovelace | Varsha Kishore | Yiwei Chen | Kilian Weinberger
Findings of the Association for Computational Linguistics: ACL 2024
Justin Lovelace | Varsha Kishore | Yiwei Chen | Kilian Weinberger
Findings of the Association for Computational Linguistics: ACL 2024
Current language models demonstrate remarkable proficiency in text generation. However, for many applications it is desirable to control attributes, such as sentiment, or toxicity, of the generated language—ideally tailored towards each specific use case and target audience. For auto-regressive language models, existing guidance methods are prone to decoding errors that cascade during generation and degrade performance. In contrast, text diffusion models can easily be guided with, for example, a simple linear sentiment classifier—however they do suffer from significantly higher perplexity than auto-regressive alternatives. In this paper we use a guided diffusion model to produce a latent proposal that steers an auto-regressive language model to generate text with desired properties. Our model inherits the unmatched fluency of the auto-regressive approach and the plug-and-play flexibility of diffusion. We show that it outperforms previous plug-and-play guidance methods across a wide range of benchmark data sets. Further, controlling a new attribute in our framework is reduced to training a single logistic regression classifier.