Abstract
n this paper, we present our contribution to the FinTOC-2022 Shared Task “Financial Document Structure Extraction”. We participated in the three tracks dedicated to English, French and Spanish document processing. Our main contribution consists in considering financial prospectus as a bundle of documents, i.e., a set of merged documents, each with their own layout and structure. Therefore, Document Layout and Structure Analysis (DLSA) first starts with the boundary detection of each document using general layout features. Then, the process applies inside each single document, taking advantage of the local properties. DLSA is achieved considering simultaneously text content, vectorial shapes and images embedded in the native PDF document. For the Title Detection task in English and French, we observed a significant improvement of the F-measures for Title Detection compared with those obtained during our previous participation.- Anthology ID:
- 2022.fnp-1.15
- Volume:
- Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Mahmoud El-Haj, Paul Rayson, Nadhem Zmandar
- Venue:
- FNP
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 100–104
- Language:
- URL:
- https://aclanthology.org/2022.fnp-1.15
- DOI:
- Cite (ACL):
- Emmanuel Giguet and Nadine Lucas. 2022. GREYC@FinTOC-2022: Handling Document Layout and Structure in Native PDF Bundle of Documents. In Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022, pages 100–104, Marseille, France. European Language Resources Association.
- Cite (Informal):
- GREYC@FinTOC-2022: Handling Document Layout and Structure in Native PDF Bundle of Documents (Giguet & Lucas, FNP 2022)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/2022.fnp-1.15.pdf