Abstract
This paper describes the acquisition, preprocessing and characteristics of SEDAR, a large scale English-French parallel corpus for the financial domain. Our extensive experiments on machine translation show that SEDAR is essential to obtain good performance on finance. We observe a large gain in the performance of machine translation systems trained on SEDAR when tested on finance, which makes SEDAR suitable to study domain adaptation for neural machine translation. The first release of the corpus comprises 8.6 million high quality sentence pairs that are publicly available for research at https://github.com/autorite/sedar-bitext.- Anthology ID:
- 2020.lrec-1.442
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 3595–3602
- Language:
- English
- URL:
- https://preview.aclanthology.org/add_missing_videos/2020.lrec-1.442/
- DOI:
- Cite (ACL):
- Abbas Ghaddar and Phillippe Langlais. 2020. SEDAR: a Large Scale French-English Financial Domain Parallel Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3595–3602, Marseille, France. European Language Resources Association.
- Cite (Informal):
- SEDAR: a Large Scale French-English Financial Domain Parallel Corpus (Ghaddar & Langlais, LREC 2020)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2020.lrec-1.442.pdf
- Code
- autorite/sedar-bitext