GREYC@FinTOC-2022: Handling Document Layout and Structure in Native PDF Bundle of Documents

Emmanuel Giguet, Nadine Lucas


Abstract
n this paper, we present our contribution to the FinTOC-2022 Shared Task “Financial Document Structure Extraction”. We participated in the three tracks dedicated to English, French and Spanish document processing. Our main contribution consists in considering financial prospectus as a bundle of documents, i.e., a set of merged documents, each with their own layout and structure. Therefore, Document Layout and Structure Analysis (DLSA) first starts with the boundary detection of each document using general layout features. Then, the process applies inside each single document, taking advantage of the local properties. DLSA is achieved considering simultaneously text content, vectorial shapes and images embedded in the native PDF document. For the Title Detection task in English and French, we observed a significant improvement of the F-measures for Title Detection compared with those obtained during our previous participation.
Anthology ID:
2022.fnp-1.15
Volume:
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Mahmoud El-Haj, Paul Rayson, Nadhem Zmandar
Venue:
FNP
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
100–104
Language:
URL:
https://aclanthology.org/2022.fnp-1.15
DOI:
Bibkey:
Cite (ACL):
Emmanuel Giguet and Nadine Lucas. 2022. GREYC@FinTOC-2022: Handling Document Layout and Structure in Native PDF Bundle of Documents. In Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022, pages 100–104, Marseille, France. European Language Resources Association.
Cite (Informal):
GREYC@FinTOC-2022: Handling Document Layout and Structure in Native PDF Bundle of Documents (Giguet & Lucas, FNP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-bitext-workshop/2022.fnp-1.15.pdf