Sylvie Brunessaux


2012

pdf
New language resources for the Pashto language
Djamel Mostefa | Khalid Choukri | Sylvie Brunessaux | Karim Boudahmane
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation.