EDGAR-CORPUS: Billions of Tokens Make The World Go Round
Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, Prodromos Malakasiotis
Abstract
We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We employ these embeddings in a battery of financial NLP tasks and showcase their superiority over generic GloVe embeddings and other existing financial word embeddings. We also open-source EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future annual reports.- Anthology ID:
- 2021.econlp-1.2
- Volume:
- Proceedings of the Third Workshop on Economics and Natural Language Processing
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Udo Hahn, Veronique Hoste, Amanda Stent
- Venue:
- ECONLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13–18
- Language:
- URL:
- https://aclanthology.org/2021.econlp-1.2
- DOI:
- 10.18653/v1/2021.econlp-1.2
- Cite (ACL):
- Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, and Prodromos Malakasiotis. 2021. EDGAR-CORPUS: Billions of Tokens Make The World Go Round. In Proceedings of the Third Workshop on Economics and Natural Language Processing, pages 13–18, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/2021.econlp-1.2.pdf
- Data
- EDGAR-CORPUS