Debasish Kumar Mallick
2020
ODIANLP’s Participation in WAT2020
Shantipriya Parida
|
Petr Motlicek
|
Amulya Ratna Dash
|
Satya Ranjan Dash
|
Debasish Kumar Mallick
|
Satya Prakash Biswal
|
Priyanka Pattnaik
|
Biranchi Narayan Nayak
|
Ondřej Bojar
Proceedings of the 7th Workshop on Asian Translation
This paper describes the ODIANLP submission to WAT 2020. We have participated in the English-Hindi Multimodal task and Indic task. We have used the state-of-the-art Transformer model for the translation task and InceptionResNetV2 for the Hindi Image Captioning task. Our submission tops in English->Hindi Multimodal task in its track and Odia<->English translation tasks. Also, our submissions performed well in the Indic Multilingual tasks.
OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation
Shantipriya Parida
|
Satya Ranjan Dash
|
Ondřej Bojar
|
Petr Motlicek
|
Priyanka Pattnaik
|
Debasish Kumar Mallick
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
The preparation of parallel corpora is a challenging task, particularly for languages that suffer from under-representation in the digital world. In a multi-lingual country like India, the need for such parallel corpora is stringent for several low-resource languages. In this work, we provide an extended English-Odia parallel corpus, OdiEnCorp 2.0, aiming particularly at Neural Machine Translation (NMT) systems which will help translate English↔Odia. OdiEnCorp 2.0 includes existing English-Odia corpora and we extended the collection by several other methods of data acquisition: parallel data scraping from many websites, including Odia Wikipedia, but also optical character recognition (OCR) to extract parallel data from scanned images. Our OCR-based data extraction approach for building a parallel corpus is suitable for other low resource languages that lack in online content. The resulting OdiEnCorp 2.0 contains 98,302 sentences and 1.69 million English and 1.47 million Odia tokens. To the best of our knowledge, OdiEnCorp 2.0 is the largest Odia-English parallel corpus covering different domains and available freely for non-commercial and research purposes.
Search
Co-authors
- Shantipriya Parida 2
- Petr Motlicek 2
- Satya Ranjan Dash 2
- Priyanka Pattnaik 2
- Ondřej Bojar 2
- show all...