Collage: Decomposable Rapid Prototyping for Co-Designed Information Extraction on Scientific PDFs

Sireesh Gururaja, Yueheng Zhang, Guannan Tang, Tianhao Zhang, Kevin Murphy, Yu-Tsen Yi, Junwon Seo, Anthony Rollett, Emma Strubell


Abstract
Recent years in NLP have seen the continued development of domain-specific information extraction tools for scientific documents, alongside the release of increasingly multimodal pretrained language models. While applying and evaluating these new, general-purpose language model systems in specialized domains has never been easier, it remains difficult to compare them with models developed specifically for those domains, which tend to accept a narrower range of input formats, and are difficult to evaluate in the context of the original documents. Meanwhile, the general-purpose systems are often black-box and give little insight into preprocessing (like conversion to plain text or markdown) that can have significant downstream impact on their results.In this work, we present Collage, a tool intended to facilitate the co-design of information extraction systems on scientific PDFs between NLP developers and scientists by facilitating the rapid prototyping, visualization, and comparison of different information extraction models on scientific PDFs, regardless of their input modality. For scientists, Collage provides side-by-side visualization and comparison of multiple models of different input and output modalities in the context of the PDF content they are applied to; for developers, Collage allows the rapid deployment of new models by abstracting away PDF preprocessing and visualization into easily extensible software interfaces. Further, we enable both developers and scientists to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing. We demonstrate our system in the context of information extraction to assist with literature review in materials science.
Anthology ID:
2025.sdp-1.7
Volume:
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Tirthankar Ghosal, Philipp Mayr, Amanpreet Singh, Aakanksha Naik, Georg Rehm, Dayne Freitag, Dan Li, Sonja Schimmler, Anita De Waard
Venues:
sdp | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
72–82
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.sdp-1.7/
DOI:
Bibkey:
Cite (ACL):
Sireesh Gururaja, Yueheng Zhang, Guannan Tang, Tianhao Zhang, Kevin Murphy, Yu-Tsen Yi, Junwon Seo, Anthony Rollett, and Emma Strubell. 2025. Collage: Decomposable Rapid Prototyping for Co-Designed Information Extraction on Scientific PDFs. In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 72–82, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Collage: Decomposable Rapid Prototyping for Co-Designed Information Extraction on Scientific PDFs (Gururaja et al., sdp 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.sdp-1.7.pdf