Abstract
Recent advances in neural computing and word embeddings for semantic processing open many new applications areas which had been left unaddressed so far because of inadequate language understanding capacity. But this new kind of approaches rely even more on training data to be operational. Corpora for financial applications exists, but most of them concern stock market prediction and are in English. To address this need for the French language and regulation oriented applications which require a deeper understanding of the text content, we hereby present “DoRe”, a French and dialectal French Corpus for NLP analytics in Finance, Regulation and Investment. This corpus is composed of: (a) 1769 Annual Reports from 336 companies among the most capitalized companies in: France (Euronext Paris) & Belgium (Euronext Brussels), covering a time frame from 2009 to 2019, and (b) related MetaData containing information for each company about its ISIN code, capitalization and sector. This corpus is designed to be as modular as possible in order to allow for maximum reuse in different tasks pertaining to Economics, Finance and Regulation. After presenting existing resources, we relate the construction of the DoRe corpus and the rationale behind our choices, concluding on the spectrum of possible uses of this new resource for NLP applications.- Anthology ID:
- 2020.lrec-1.275
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 2261–2267
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.275
- DOI:
- Cite (ACL):
- Corentin Masson and Patrick Paroubek. 2020. NLP Analytics in Finance with DoRe: A French 250M Tokens Corpus of Corporate Annual Reports. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2261–2267, Marseille, France. European Language Resources Association.
- Cite (Informal):
- NLP Analytics in Finance with DoRe: A French 250M Tokens Corpus of Corporate Annual Reports (Masson & Paroubek, LREC 2020)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2020.lrec-1.275.pdf