SENCORPUS: A French-Wolof Parallel Corpus

Elhadji Mamadou Nguer, Alla Lo, Cheikh M. Bamba Dione, Sileye O. Ba, Moussa Lo


Abstract
In this paper, we report efforts towards the acquisition and construction of a bilingual parallel corpus between French and Wolof, a Niger-Congo language belonging to the Northern branch of the Atlantic group. The corpus is constructed as part of the SYSNET3LOc project. It currently contains about 70,000 French-Wolof parallel sentences drawn on various sources from different domains. The paper discusses the data collection procedure, conversion, and alignment of the corpus as well as it’s application as training data for neural machine translation. In fact, using this corpus, we were able to create word embedding models for Wolof with relatively good results. Currently, the corpus is being used to develop a neural machine translation model to translate French sentences into Wolof.
Anthology ID:
2020.lrec-1.341
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2803–2811
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.341
DOI:
Bibkey:
Cite (ACL):
Elhadji Mamadou Nguer, Alla Lo, Cheikh M. Bamba Dione, Sileye O. Ba, and Moussa Lo. 2020. SENCORPUS: A French-Wolof Parallel Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2803–2811, Marseille, France. European Language Resources Association.
Cite (Informal):
SENCORPUS: A French-Wolof Parallel Corpus (Nguer et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/2020.lrec-1.341.pdf