swapUNIBA@FinTOC2022: Fine-tuning Pre-trained Document Image Analysis Model for Title Detection on the Financial Domain
Pierluigi Cassotti, Cataldo Musto, Marco DeGemmis, Georgios Lekkas, Giovanni Semeraro
Abstract
In this paper, we introduce the results of our submitted system to the FinTOC 2022 task. We address the task using a two-stage process: first, we detect titles using Document Image Analysis, then we train a supervised model for the hierarchical level prediction. We perform Document Image Analysis using a pre-trained Faster R-CNN on the PublyaNet dataset. We fine-tuned the model on the FinTOC 2022 training set. We extract orthographic and layout features from detected titles and use them to train a Random Forest model to predict the title level. The proposed system ranked #1 on both Title Detection and the Table of Content extraction tasks for Spanish. The system ranked #3 on both the two subtasks for English and French.- Anthology ID:
- 2022.fnp-1.14
- Volume:
- Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Mahmoud El-Haj, Paul Rayson, Nadhem Zmandar
- Venue:
- FNP
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 95–99
- Language:
- URL:
- https://aclanthology.org/2022.fnp-1.14
- DOI:
- Cite (ACL):
- Pierluigi Cassotti, Cataldo Musto, Marco DeGemmis, Georgios Lekkas, and Giovanni Semeraro. 2022. swapUNIBA@FinTOC2022: Fine-tuning Pre-trained Document Image Analysis Model for Title Detection on the Financial Domain. In Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022, pages 95–99, Marseille, France. European Language Resources Association.
- Cite (Informal):
- swapUNIBA@FinTOC2022: Fine-tuning Pre-trained Document Image Analysis Model for Title Detection on the Financial Domain (Cassotti et al., FNP 2022)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2022.fnp-1.14.pdf
- Data
- PubLayNet