EnAsCorp1.0: English-Assamese Corpus

Sahinur Rahman Laskar, Abdullah Faiz Ur Rahman Khilji, Partha Pakray, Sivaji Bandyopadhyay


Abstract
The corpus preparation is one of the important challenging task for the domain of machine translation especially in low resource language scenarios. Country like India where multiple languages exists, machine translation attempts to minimize the communication gap among people with different linguistic backgrounds. Although Google Translation covers automatic translation of various languages all over the world but it lags in some languages including Assamese. In this paper, we have developed EnAsCorp1.0, corpus of English-Assamese low resource pair where parallel and monolingual data are collected from various online sources. We have also implemented baseline systems with statistical machine translation and neural machine translation approaches for the same corpus.
Anthology ID:
2020.loresmt-1.9
Volume:
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
Month:
December
Year:
2020
Address:
Suzhou, China
Venue:
LoResMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
62–68
Language:
URL:
https://aclanthology.org/2020.loresmt-1.9
DOI:
Bibkey:
Cite (ACL):
Sahinur Rahman Laskar, Abdullah Faiz Ur Rahman Khilji, Partha Pakray, and Sivaji Bandyopadhyay. 2020. EnAsCorp1.0: English-Assamese Corpus. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, pages 62–68, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
EnAsCorp1.0: English-Assamese Corpus (Laskar et al., LoResMT 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.loresmt-1.9.pdf
Data
PMIndia