2022
pdf
abs
Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation
Idris Abdulmumin
|
Satya Ranjan Dash
|
Musa Abdullahi Dawud
|
Shantipriya Parida
|
Shamsuddeen Muhammad
|
Ibrahim Sa’id Ahmad
|
Subhadarshi Panda
|
Ondřej Bojar
|
Bashir Shehu Galadanci
|
Bello Shehu Bello
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations, especially where the full context is not available to enable the unambiguous translation in standard machine translation. Despite the increasing popularity of such technique, it lacks sufficient and qualitative datasets to maximize the full extent of its potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite the large number of speakers, the Hausa language is considered as a low resource language in natural language processing (NLP). This is due to the absence of enough resources to implement most of the tasks in NLP. While some datasets exist, they are either scarce, machine-generated or in the religious domain. Therefore, there is the need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. The dataset was prepared by automatically translating the English description of the images in the Hindi Visual Genome (HVG). The synthetic Hausa data was then carefully postedited, taking into cognizance the respective images. The data is made of 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, image description, among various other natural language processing and generation tasks.
pdf
abs
Universal Dependency Treebank for Odia Language
Shantipriya Parida
|
Kalyanamalini Shabadi
|
Atul Kr. Ojha
|
Saraswati Sahoo
|
Satya Ranjan Dash
|
Bijayalaxmi Dash
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference
This paper presents the first publicly available treebank of Odia, a morphologically rich low resource Indian language. The treebank contains approx. 1082 tokens (100 sentences) in Odia were selected from “Samantar”, the largest available parallel corpora collection for Indic languages. All the selected sentences are manually annotated following the “Universal Dependency” guidelines. The morphological analysis of the Odia treebank was performed using machine learning techniques. The Odia annotated treebank will enrich the Odia language resource and will help in building language technology tools for cross-lingual learning and typological research. We also build a preliminary Odia parser using a machine learning approach. The accuracy of the parser is 86.6% Tokenization, 64.1% UPOS, 63.78% XPOS, 42.04% UAS and 21.34% LAS. Finally, the paper briefly discusses the linguistic analysis of the Odia UD treebank.
2021
pdf
abs
Multimodal Neural Machine Translation System for English to Bengali
Shantipriya Parida
|
Subhadarshi Panda
|
Satya Prakash Biswal
|
Ketan Kotwal
|
Arghyadeep Sen
|
Satya Ranjan Dash
|
Petr Motlicek
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)
Multimodal Machine Translation (MMT) systems utilize additional information from other modalities beyond text to improve the quality of machine translation (MT). The additional modality is typically in the form of images. Despite proven advantages, it is indeed difficult to develop an MMT system for various languages primarily due to the lack of a suitable multimodal dataset. In this work, we develop an MMT for English-> Bengali using a recently published Bengali Visual Genome (BVG) dataset that contains images with associated bilingual textual descriptions. Through a comparative study of the developed MMT system vis-a-vis a Text-to-text translation, we demonstrate that the use of multimodal data not only improves the translation performance improvement in BLEU score of +1.3 on the development set, +3.9 on the evaluation test, and +0.9 on the challenge test set but also helps to resolve ambiguities in the pure text description. As per best of our knowledge, our English-Bengali MMT system is the first attempt in this direction, and thus, can act as a baseline for the subsequent research in MMT for low resource languages.
pdf
abs
NLPHut’s Participation at WAT2021
Shantipriya Parida
|
Subhadarshi Panda
|
Ketan Kotwal
|
Amulya Ratna Dash
|
Satya Ranjan Dash
|
Yashvardhan Sharma
|
Petr Motlicek
|
Ondřej Bojar
Proceedings of the 8th Workshop on Asian Translation (WAT2021)
This paper provides the description of shared tasks to the WAT 2021 by our team “NLPHut”. We have participated in the English→Hindi Multimodal translation task, English→Malayalam Multimodal translation task, and Indic Multi-lingual translation task. We have used the state-of-the-art Transformer model with language tags in different settings for the translation task and proposed a novel “region-specific” caption generation approach using a combination of image CNN and LSTM for the Hindi and Malayalam image captioning. Our submission tops in English→Malayalam Multimodal translation task (text-only translation, and Malayalam caption), and ranks second-best in English→Hindi Multimodal translation task (text-only translation, and Hindi caption). Our submissions have also performed well in the Indic Multilingual translation tasks.
2020
pdf
abs
ODIANLP’s Participation in WAT2020
Shantipriya Parida
|
Petr Motlicek
|
Amulya Ratna Dash
|
Satya Ranjan Dash
|
Debasish Kumar Mallick
|
Satya Prakash Biswal
|
Priyanka Pattnaik
|
Biranchi Narayan Nayak
|
Ondřej Bojar
Proceedings of the 7th Workshop on Asian Translation
This paper describes the ODIANLP submission to WAT 2020. We have participated in the English-Hindi Multimodal task and Indic task. We have used the state-of-the-art Transformer model for the translation task and InceptionResNetV2 for the Hindi Image Captioning task. Our submission tops in English->Hindi Multimodal task in its track and Odia<->English translation tasks. Also, our submissions performed well in the Indic Multilingual tasks.
pdf
abs
OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation
Shantipriya Parida
|
Satya Ranjan Dash
|
Ondřej Bojar
|
Petr Motlicek
|
Priyanka Pattnaik
|
Debasish Kumar Mallick
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
The preparation of parallel corpora is a challenging task, particularly for languages that suffer from under-representation in the digital world. In a multi-lingual country like India, the need for such parallel corpora is stringent for several low-resource languages. In this work, we provide an extended English-Odia parallel corpus, OdiEnCorp 2.0, aiming particularly at Neural Machine Translation (NMT) systems which will help translate English↔Odia. OdiEnCorp 2.0 includes existing English-Odia corpora and we extended the collection by several other methods of data acquisition: parallel data scraping from many websites, including Odia Wikipedia, but also optical character recognition (OCR) to extract parallel data from scanned images. Our OCR-based data extraction approach for building a parallel corpus is suitable for other low resource languages that lack in online content. The resulting OdiEnCorp 2.0 contains 98,302 sentences and 1.69 million English and 1.47 million Odia tokens. To the best of our knowledge, OdiEnCorp 2.0 is the largest Odia-English parallel corpus covering different domains and available freely for non-commercial and research purposes.