UWB@FinTOC-2020 Shared Task: Financial Document Title Detection

Tomáš Hercig, Pavel Kral


Abstract
This paper describes our system created for the Financial Document Structure Extraction Shared Task (FinTOC-2020): Title Detection. We rely on the Apache PDFBox library to extract text and all additional information e.g. font type and font size from the financial prospectuses. Our constrained system uses only the provided training data without any additional external resources. Our system is based on the Maximum Entropy classifier and various features including font type and font size. Our system achieves F1 score 81% and #1 place in the French track and F1 score 77% and #2 place among 5 participating teams in the English track.
Anthology ID:
2020.fnp-1.27
Volume:
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
FNP
SIG:
Publisher:
COLING
Note:
Pages:
158–162
Language:
URL:
https://aclanthology.org/2020.fnp-1.27
DOI:
Bibkey:
Cite (ACL):
Tomáš Hercig and Pavel Kral. 2020. UWB@FinTOC-2020 Shared Task: Financial Document Title Detection. In Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, pages 158–162, Barcelona, Spain (Online). COLING.
Cite (Informal):
UWB@FinTOC-2020 Shared Task: Financial Document Title Detection (Hercig & Kral, FNP 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.fnp-1.27.pdf