David Sontag
2026
Scaling Collaborative Effort with Agents
Shannon Zejiang Shen | Valerie Chen | Ken Gu | Alexis Ross | Zixian Ma | Jillian Ross | Alex Gu | Chenglei Si | Wayne Chi | Andi Peng | Jocelyn J Shen | Ameet Talwalkar | Tongshuang Wu | David Sontag
Findings of the Association for Computational Linguistics: ACL 2026
Shannon Zejiang Shen | Valerie Chen | Ken Gu | Alexis Ross | Zixian Ma | Jillian Ross | Alex Gu | Chenglei Si | Wayne Chi | Andi Peng | Jocelyn J Shen | Ameet Talwalkar | Tongshuang Wu | David Sontag
Findings of the Association for Computational Linguistics: ACL 2026
Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent’s utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
2024
Learning to Decode Collaboratively with Multiple Language Models
Zejiang Shen | Hunter Lang | Bailin Wang | Yoon Kim | David Sontag
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zejiang Shen | Hunter Lang | Bailin Wang | Yoon Kim | David Sontag
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the marginal likelihood of a training set under our latent variable model, the base LLM automatically learns when to generate itself and when to call on one of the “assistant” language models to generate, all without direct supervision. Token-level collaboration during decoding allows for a fusion of each model’s expertise in a manner tailored to the specific task at hand. Our collaborative decoding is especially useful in cross-domain settings where a generalist base LLM learns to invoke domain expert models. On instruction-following, domain-specific QA, and reasoning tasks, we show that the performance of the joint system exceeds that of the individual models. Through qualitative analysis, we show models trained with our method exhibit several interesting collaboration patterns, e.g., template-filling, by visualizing the learned latent decisions.
2022
Large language models are few-shot clinical information extractors
Monica Agrawal | Stefan Hegselmann | Hunter Lang | Yoon Kim | David Sontag
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Monica Agrawal | Stefan Hegselmann | Hunter Lang | Yoon Kim | David Sontag
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
A long-running goal of the clinical NLP community is the extraction of important variables trapped in clinical notes. However, roadblocks have included dataset shift from the general domain and a lack of public clinical corpora and annotations. In this work, we show that large language models, such as InstructGPT (Ouyang et al., 2022), perform well at zero- and few-shot information extraction from clinical text despite not being trained specifically for the clinical domain. Whereas text classification and generation performance have already been studied extensively in such models, here we additionally demonstrate how to leverage them to tackle a diverse set of NLP tasks which require more structured outputs, including span identification, token-level sequence classification, and relation extraction. Further, due to the dearth of available data to evaluate these systems, we introduce new datasets for benchmarking few-shot clinical information extraction based on a manual re-annotation of the CASI dataset (Moon et al., 2014) for new tasks. On the clinical extraction tasks we studied, the GPT-3 systems significantly outperform existing zero- and few-shot baselines.
2021
CLIP: A Dataset for Extracting Action Items for Physicians from Hospital Discharge Notes
James Mullenbach | Yada Pruksachatkun | Sean Adler | Jennifer Seale | Jordan Swartz | Greg McKelvey | Hui Dai | Yi Yang | David Sontag
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
James Mullenbach | Yada Pruksachatkun | Sean Adler | Jennifer Seale | Jordan Swartz | Greg McKelvey | Hui Dai | Yi Yang | David Sontag
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Continuity of care is crucial to ensuring positive health outcomes for patients discharged from an inpatient hospital setting, and improved information sharing can help. To share information, caregivers write discharge notes containing action items to share with patients and their future caregivers, but these action items are easily lost due to the lengthiness of the documents. In this work, we describe our creation of a dataset of clinical action items annotated over MIMIC-III, the largest publicly available dataset of real clinical notes. This dataset, which we call CLIP, is annotated by physicians and covers 718 documents representing 100K sentences. We describe the task of extracting the action items from these documents as multi-aspect extractive summarization, with each aspect representing a type of action to be taken. We evaluate several machine learning models on this task, and show that the best models exploit in-domain language model pre-training on 59K unannotated documents, and incorporate context from neighboring sentences. We also propose an approach to pre-training data selection that allows us to explore the trade-off between size and domain-specificity of pre-training datasets for this task.
2010
Search
Fix author
Co-authors
- Michael Collins 2
- Tommi Jaakkola 2
- Yoon Kim 2
- Hunter Lang 2
- Alexander M. Rush 2
- Sean Adler 1
- Monica Agrawal 1
- Valerie Chen 1
- Wayne Chi 1
- Hui Dai 1
- Ken Gu 1
- Alex Gu 1
- Stefan Hegselmann 1
- Terry Koo 1
- Zixian Ma 1
- Greg McKelvey 1
- James Mullenbach 1
- Andi Peng 1
- Yada Pruksachatkun 1
- Alexis Ross 1
- Jillian Ross 1
- Jennifer Seale 1
- Shannon Zejiang Shen 1
- Jocelyn J Shen 1
- Zejiang Shen 1
- Chenglei Si 1
- Jordan Swartz 1
- Ameet Talwalkar 1
- Bailin Wang 1
- Tongshuang Wu 1
- Yi Yang 1