MedICaT: A Dataset of Medical Images, Captions, and Textual References

Sanjay Subramanian; Lucy Lu Wang; Ben Bogin; Sachin Mehta; Madeleine van Zuylen; Sravanthi Parasa; Sameer Singh; Matt Gardner; Hannaneh Hajishirzi

doi:10.18653/v1/2020.findings-emnlp.191

MedICaT: A Dataset of Medical Images, Captions, and Textual References

Sanjay Subramanian, Lucy Lu Wang, Ben Bogin, Sachin Mehta, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi

Abstract

Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat.

Anthology ID:: 2020.findings-emnlp.191
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2112–2120
Language:
URL:: https://aclanthology.org/2020.findings-emnlp.191
DOI:: 10.18653/v1/2020.findings-emnlp.191
Bibkey:
Cite (ACL):: Sanjay Subramanian, Lucy Lu Wang, Ben Bogin, Sachin Mehta, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, and Hannaneh Hajishirzi. 2020. MedICaT: A Dataset of Medical Images, Captions, and Textual References. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2112–2120, Online. Association for Computational Linguistics.
Cite (Informal):: MedICaT: A Dataset of Medical Images, Captions, and Textual References (Subramanian et al., Findings 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/2020.findings-emnlp.191.pdf
Optional supplementary material:: 2020.findings-emnlp.191.OptionalSupplementaryMaterial.zip
Video:: https://slideslive.com/38940723
Code: allenai/medicat
Data: MedICaT, ImageNet, S2ORC

PDF Search Code Optional supplementary material Video