Parsing Electronic Theses and Dissertations Using Object Detection

Aman Ahuja; Alan Devera; Edward Alan Fox

doi:10.18653/v1/2022.wiesp-1.14

Parsing Electronic Theses and Dissertations Using Object Detection

Aman Ahuja, Alan Devera, Edward Alan Fox

Abstract

Electronic theses and dissertations (ETDs) contain valuable knowledge that can be useful for a wide range of purposes. To effectively utilize the knowledge contained in ETDs for downstream tasks such as search and retrieval, question-answering, and summarization, the data first needs to be parsed and stored in a format such as XML. However, since most of the ETDs available on the web are PDF documents, parsing them to make their data useful for downstream tasks is a challenge. In this work, we propose a dataset and a framework to help with parsing long scholarly documents such as ETDs. We take the Object Detection approach for document parsing. We first introduce a set of objects that are important elements of an ETD, along with a new dataset ETD-OD that consists of over 25K page images originating from 200 ETDs with bounding boxes around each of the objects. We also propose a framework that utilizes this dataset for converting ETDs to XML, which can further be used for ETD-related downstream tasks. Our code and pre-trained models are available at: https://github.com/Opening-ETDs/ETD-OD.

Anthology ID:: 2022.wiesp-1.14
Volume:: Proceedings of the first Workshop on Information Extraction from Scientific Publications
Month:: November
Year:: 2022
Address:: Online
Editors:: Tirthankar Ghosal, Sergi Blanco-Cuaresma, Alberto Accomazzi, Robert M. Patton, Felix Grezes, Thomas Allen
Venue:: WIESP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 121–130
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2022.wiesp-1.14/
DOI:: 10.18653/v1/2022.wiesp-1.14
Bibkey:
Cite (ACL):: Aman Ahuja, Alan Devera, and Edward Alan Fox. 2022. Parsing Electronic Theses and Dissertations Using Object Detection. In Proceedings of the first Workshop on Information Extraction from Scientific Publications, pages 121–130, Online. Association for Computational Linguistics.
Cite (Informal):: Parsing Electronic Theses and Dissertations Using Object Detection (Ahuja et al., WIESP 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2022.wiesp-1.14.pdf

PDF Cite Search Fix data