EnAsCorp1.0: English-Assamese Corpus
Sahinur Rahman Laskar, Abdullah Faiz Ur Rahman Khilji, Partha Pakray, Sivaji Bandyopadhyay
Abstract
The corpus preparation is one of the important challenging task for the domain of machine translation especially in low resource language scenarios. Country like India where multiple languages exists, machine translation attempts to minimize the communication gap among people with different linguistic backgrounds. Although Google Translation covers automatic translation of various languages all over the world but it lags in some languages including Assamese. In this paper, we have developed EnAsCorp1.0, corpus of English-Assamese low resource pair where parallel and monolingual data are collected from various online sources. We have also implemented baseline systems with statistical machine translation and neural machine translation approaches for the same corpus.- Anthology ID:
- 2020.loresmt-1.9
- Volume:
- Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
- Month:
- December
- Year:
- 2020
- Address:
- Suzhou, China
- Venue:
- LoResMT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 62–68
- Language:
- URL:
- https://aclanthology.org/2020.loresmt-1.9
- DOI:
- Cite (ACL):
- Sahinur Rahman Laskar, Abdullah Faiz Ur Rahman Khilji, Partha Pakray, and Sivaji Bandyopadhyay. 2020. EnAsCorp1.0: English-Assamese Corpus. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, pages 62–68, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- EnAsCorp1.0: English-Assamese Corpus (Laskar et al., LoResMT 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.loresmt-1.9.pdf
- Data
- PMIndia