2025
pdf
bib
abs
Beyond Text: Characterizing Domain Expert Needs in Document Research
Sireesh Gururaja
|
Nupoor Gandhi
|
Jeremiah Milbauer
|
Emma Strubell
Findings of the Association for Computational Linguistics: ACL 2025
Working with documents is a key part of almost any knowledge work, from contextualizing research in a literature review to reviewing legal precedent. Recently, as their capabilities have expanded, primarily text-based NLP systems have often been billed as able to assist or even automate this kind of work. But to what extent are these systems able to model these tasks as experts conceptualize and perform them now? In this study, we interview sixteen domain experts across two domains to understand their processes of document research, and compare it to the current state of NLP systems. We find that our participants processes are idiosyncratic, iterative, and rely extensively on the social context of a document in addition its content, and that approaches in NLP and adjacent fields that explicitly center the document as an object, rather than as merely a container for text, tend to better reflect our participants’ priorities. We call on the NLP community to more carefully consider the role of the document in building useful tools that are accessible, personalizable, iterative, and socially aware.
pdf
bib
abs
Collage: Decomposable Rapid Prototyping for Co-Designed Information Extraction on Scientific PDFs
Sireesh Gururaja
|
Yueheng Zhang
|
Guannan Tang
|
Tianhao Zhang
|
Kevin Murphy
|
Yu-Tsen Yi
|
Junwon Seo
|
Anthony Rollett
|
Emma Strubell
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Recent years in NLP have seen the continued development of domain-specific information extraction tools for scientific documents, alongside the release of increasingly multimodal pretrained language models. While applying and evaluating these new, general-purpose language model systems in specialized domains has never been easier, it remains difficult to compare them with models developed specifically for those domains, which tend to accept a narrower range of input formats, and are difficult to evaluate in the context of the original documents. Meanwhile, the general-purpose systems are often black-box and give little insight into preprocessing (like conversion to plain text or markdown) that can have significant downstream impact on their results.In this work, we present Collage, a tool intended to facilitate the co-design of information extraction systems on scientific PDFs between NLP developers and scientists by facilitating the rapid prototyping, visualization, and comparison of different information extraction models on scientific PDFs, regardless of their input modality. For scientists, Collage provides side-by-side visualization and comparison of multiple models of different input and output modalities in the context of the PDF content they are applied to; for developers, Collage allows the rapid deployment of new models by abstracting away PDF preprocessing and visualization into easily extensible software interfaces. Further, we enable both developers and scientists to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing. We demonstrate our system in the context of information extraction to assist with literature review in materials science.
2023
pdf
bib
abs
Linguistic representations for fewer-shot relation extraction across domains
Sireesh Gururaja
|
Ritam Dutt
|
Tinglong Liao
|
Carolyn Rosé
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent work has demonstrated the positive impact of incorporating linguistic representations as additional context and scaffolds on the in-domain performance of several NLP tasks. We extend this work by exploring the impact of linguistic representations on cross-domain performance in a few-shot transfer setting. An important question is whether linguistic representations enhance generalizability by providing features that function as cross-domain pivots. We focus on the task of relation extraction on three datasets of procedural text in two domains, cooking and materials science. Our approach augments a popular transformer-based architecture by alternately incorporating syntactic and semantic graphs constructed by freely available off-the-shelf tools. We examine their utility for enhancing generalization, and investigate whether earlier findings, e.g. that semantic representations can be more helpful than syntactic ones, extend to relation extraction in multiple domains. We find that while the inclusion of these graphs results in significantly higher performance in few-shot transfer, both types of graph exhibit roughly equivalent utility.
pdf
bib
abs
To Build Our Future, We Must Know Our Past: Contextualizing Paradigm Shifts in Natural Language Processing
Sireesh Gururaja
|
Amanda Bertsch
|
Clara Na
|
David Widder
|
Emma Strubell
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
NLP is in a period of disruptive change that is impacting our methodologies, funding sources, and public perception. In this work, we seek to understand how to shape our future by better understanding our past. We study factors that shape NLP as a field, including culture, incentives, and infrastructure by conducting long-form interviews with 26 NLP researchers of varying seniority, research area, institution, and social identity. Our interviewees identify cyclical patterns in the field, as well as new shifts without historical parallel, including changes in benchmark culture and software infrastructure. We complement this discussion with quantitative analysis of citation, authorship, and language use in the ACL Anthology over time. We conclude by discussing shared visions, concerns, and hopes for the future of NLP. We hope that this study of our field’s past and present can prompt informed discussion of our community’s implicit norms and more deliberate action to consciously shape the future.
2022
pdf
bib
abs
R3 : Refined Retriever-Reader pipeline for Multidoc2dial
Srijan Bansal
|
Suraj Tripathi
|
Sumit Agarwal
|
Sireesh Gururaja
|
Aditya Srikanth Veerubhotla
|
Ritam Dutt
|
Teruko Mitamura
|
Eric Nyberg
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering
In this paper, we present our submission to the DialDoc shared task based on the MultiDoc2Dial dataset. MultiDoc2Dial is a conversational question answering dataset that grounds dialogues in multiple documents. The task involves grounding a user’s query in a document followed by generating an appropriate response. We propose several improvements over the baseline’s retriever-reader architecture to aid in modeling goal-oriented dialogues grounded in multiple documents. Our proposed approach employs sparse representations for passage retrieval, a passage re-ranker, the fusion-in-decoder architecture for generation, and a curriculum learning training paradigm. Our approach shows a 12 point improvement in BLEU score compared to the baseline RAG model.