Shannon Zejiang Shen
2026
When One LLM Drools, Multi-LLM Collaboration Rules
Shangbin Feng | Wenxuan Ding | Alisa Liu | Zifeng Wang | Weijia Shi | Yike Wang | Shannon Zejiang Shen | Xiaochuang Han | Hunter Lang | Chen-Yu Lee | Tomas Pfister | Yejin Choi | Yulia Tsvetkov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shangbin Feng | Wenxuan Ding | Alisa Liu | Zifeng Wang | Weijia Shi | Yike Wang | Shannon Zejiang Shen | Xiaochuang Han | Hunter Lang | Chen-Yu Lee | Tomas Pfister | Yejin Choi | Yulia Tsvetkov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This position paper argues that in many realistic (i.e., complex, contextualized, subjective) scenarios, one LLM is not enough to produce a reliable output. We challenge the status quo of relying solely on a single general-purpose LLM and argue for multi-LLM collaboration to better represent the extensive diversity of data, skills, and people. We first posit that a single LLM underrepresents real-world data distributions, heterogeneous skills, and pluralistic populations, and that such representation gaps cannot be trivially patched by further training a single LLM. We then organize existing multi-LLM collaboration methods into a hierarchy, based on the level of access and information exchange, ranging from API-level, text-level, logit-level, to weight-level collaboration. Based on these methods, we highlight how multi-LLM collaboration addresses challenges that a single LLM struggles with, such as reliability, democratization, and pluralism. Finally, we identify the limitations of existing multi-LLM methods and motivate future work. We envision multi-LLM collaboration as an essential path toward compositional intelligence and collaborative AI development.
GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs
Ying-Hsiang Huang | Claire Gong | Shreya Shaji | Alison R Yan | Leslie Harka | Albert Du | Anjali Shubha Gopal | Samuel J Klein | Shannon Zejiang Shen | Mark E. Phillips | Trevor Owens | Kyle Deeds | Benjamin Charles Germain Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Ying-Hsiang Huang | Claire Gong | Shreya Shaji | Alison R Yan | Leslie Harka | Albert Du | Anjali Shubha Gopal | Samuel J Klein | Shannon Zejiang Shen | Mark E. Phillips | Trevor Owens | Kyle Deeds | Benjamin Charles Germain Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) – to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as “redacted documents” or “pie charts.” We detail GovScape’s search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape’s pre-processing pipeline for 10 million PDFs was approximately 1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. We evaluate GovScape by (1) analyzing 1,679 search queries and (2) benchmarking vector and keyword index efficiency using these queries. GovScape can be found at https://www.govscape.net.
2025
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature
David Wadden | Kejian Shi | Jacob Morrison | Alan Li | Aakanksha Naik | Shruti Singh | Nitzan Barzilay | Kyle Lo | Tom Hope | Luca Soldaini | Shannon Zejiang Shen | Doug Downey | Hannaneh Hajishirzi | Arman Cohan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
David Wadden | Kejian Shi | Jacob Morrison | Alan Li | Aakanksha Naik | Shruti Singh | Nitzan Barzilay | Kyle Lo | Tom Hope | Luca Soldaini | Shannon Zejiang Shen | Doug Downey | Hannaneh Hajishirzi | Arman Cohan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We present ScIRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks. These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. ScIRIFF is unique in being the only entirely expert-written, high-quality instruction-following dataset designed for extracting and synthesizing information from research literature across diverse scientific fields. It features complex instructions with long input contexts, detailed task descriptions, and structured outputs. To demonstrate its utility, we finetune a series of large language models (LLMs) using a mix of general domain and ScIRIFF instructions. On nine out-of-distribution held-out tasks (referred to as SciRIFF-Eval), LLMs finetuned on SciRIFF achieve 70.6% average improvement over our baselines trained only on general-domain instructions. ScIRIFF facilitates the development and evaluation of LLMs to help researchers navigate the rapidly growing body of scientific literature.
CourtReasoner: Can LLM Agents Reason Like Judges?
Sophia Simeng Han | Yoshiki Takashima | Shannon Zejiang Shen | Chen Liu | Yixin Liu | Roque K. Thuo | Sonia Knowlton | Ruzica Piskac | Scott J Shapiro | Arman Cohan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Sophia Simeng Han | Yoshiki Takashima | Shannon Zejiang Shen | Chen Liu | Yixin Liu | Roque K. Thuo | Sonia Knowlton | Ruzica Piskac | Scott J Shapiro | Arman Cohan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
LLMs are increasingly applied in the legal domain in tasks such as summarizing legal texts and providing basic legal advice. Yet, their capacity to draft full judicial analyses in U.S. court opinions is still largely uncharted, such as generating entire judicial reasoning sections in U.S. court decisions, remain under-explored. Given the continued adoption of LLMs and the significance of law to society at large, measurement of LLM’s legal reasoning capabilities is a pressing task. We propose CourtReasoner, a novel expert-annotated judicial reasoning benchmark for evaluating LLM agents’ capabilities in complex legal reasoning. Sourcing U.S. court opinions, we construct benchmarks that measure the LLMs ability to construct goal-oriented legal reasoning. CourtReasoner measured the agent’s ability to argue both ways in a legal dispute, rather than simple Q/A. Our results show that more than 60% of frontier model outputs contain invalid arguments and more than 53% of frontier model produced irrelevant citations when conducting complex legal reasoning. We also introduce a meta-evaluation benchmark to provide insights into the capabilities of LLMs as evaluators of legal reasoning. We will release our data, code and full annotation guidelines publicly for future research.
Search
Fix author
Co-authors
- Arman Cohan 2
- Nitzan Barzilay 1
- Yejin Choi 1
- Kyle Deeds 1
- Wenxuan Ding 1
- Doug Downey 1
- Albert Du 1
- Shangbin Feng 1
- Claire Gong 1
- Anjali Shubha Gopal 1
- Hannaneh Hajishirzi 1
- Sophia Simeng Han 1
- Xiaochuang Han 1
- Leslie Harka 1
- Tom Hope 1
- Ying-Hsiang Huang 1
- Samuel J Klein 1
- Sonia Knowlton 1
- Hunter Lang 1
- Benjamin Charles Germain Lee 1
- Chen-Yu Lee 1
- Alan Li 1
- Alisa Liu 1
- Chen Liu 1
- Yixin Liu 1
- Kyle Lo 1
- Jacob Morrison 1
- Aakanksha Naik 1
- Trevor Owens 1
- Tomas Pfister 1
- Mark E. Phillips 1
- Ruzica Piskac 1
- Shreya Shaji 1
- Scott J Shapiro 1
- Kejian Shi 1
- Weijia Shi 1
- Shruti Singh 1
- Luca Soldaini 1
- Yoshiki Takashima 1
- Roque K. Thuo 1
- Yulia Tsvetkov 1
- David Wadden 1
- Yike Wang 1
- Zifeng Wang 1
- Alison R Yan 1