EM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English

Rudali Huidrom, Yves Lepage, Khogendra Khomdram


Abstract
In this paper, we introduce a sentence-level comparable text corpus crawled and created for the less-resourced language pair, Manipuri(mni) and English (eng). Our monolingual corpora comprise 1.88 million Manipuri sentences and 1.45 million English sentences, and our parallel corpus comprises 124,975 Manipuri-English sentence pairs. These data were crawled and collected over a year from August 2020 to March 2021 from a local newspaper website called ‘The Sangai Express.’ The resources reported in this paper are made available to help the low-resourced languages community for MT/NLP tasks.
Anthology ID:
2021.bucc-1.8
Volume:
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
Month:
September
Year:
2021
Address:
Online (Virtual Mode)
Venue:
BUCC
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
60–67
Language:
URL:
https://aclanthology.org/2021.bucc-1.8
DOI:
Bibkey:
Cite (ACL):
Rudali Huidrom, Yves Lepage, and Khogendra Khomdram. 2021. EM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 60–67, Online (Virtual Mode). INCOMA Ltd..
Cite (Informal):
EM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English (Huidrom et al., BUCC 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2021.bucc-1.8.pdf
Data
PMIndia