New language resources for the Pashto language
Djamel Mostefa, Khalid Choukri, Sylvie Brunessaux, Karim Boudahmane
Abstract
This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation.- Anthology ID:
- L12-1490
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2917–2922
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/824_Paper.pdf
- DOI:
- Cite (ACL):
- Djamel Mostefa, Khalid Choukri, Sylvie Brunessaux, and Karim Boudahmane. 2012. New language resources for the Pashto language. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2917–2922, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- New language resources for the Pashto language (Mostefa et al., LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/824_Paper.pdf